Introduction to Apache Solr

Posted in Uncategorized By Raj On September 25, 2016

Solr is an enterprise search platform which is built on top of Lucene Java Search Library, the Lucene is a high-performance, scalable search library and it is best suitable for the people those who want to perform full-text search operations in their application, the Lucene is completely built using Java so it supports any application

When to use Lucene? if you are expecting more control on the system then you can use Lucene library, but you must spend some additional amount of time to develop features which Lucene does not provide and Solr provides, very simple example is with Lucene you can index PDF files, for this you need to use Apache Tika PDFBox open-source parser which gets content of the PDF file and send it to Document object by adding new text field

Solr

Solr is an enterprise search platform, so you can use Solr to connect to various repositories such as flat files, cms, database etc…

How Solr index files ? What will happen when we run below command ?

The above command reads all pdf files from the downloads folder and uploads into the Solr core_directory folder to index the content of pdf files, before indexing it follows different steps – refer above diagram

Acquire Content

PDF and word documents are binary files, inorder to read the content of these binary files, Solr internally uses Apache Tika which is a built-in feature of Apache Solr by using its ExtractRequestHandler

Build Documentsolrdocument

Apache Tika is an open source content extraction framework which is built on top of open-source content extraction libraries such as PDFBox, Apache POI, etc… the ExtractionHandler internally uses Apache Tika to find out mime type of uploaded file, based on the mime type the Tike loads respective parser, if  the mime-type is application/pdf then the Tika will load PDFBox parser automatically

The respective parser reads the content and metadata of uploaded file and build SolrInputDocument

Analyzer

Once parser build document the analyzer examines the text of fields and generate tokens, generating tokens depends on what Tokenizer you have specified in schema.xml for the particular text field

 
The StandardTokenizerFactory splits the text fields into tokens based on whitespace and punctuation as delimiters, something like below

Index Document

The Apache Solr internally uses Lucene inverted index, most of the search engines uses inverted index data structure to achieve better performance, in the inverted index the search term’s will be having associated document id’s, in the below example once the user issues a query, it will search for the terms and the associated documents. It is the optimized way to get then fast search results from the search engine.

index