Introduction to Apache Solr

Solr is an enterprise search platform which is build on top of Lucene Java Search Library , the Lucene is a high-performance , scalable search library and it is best suitable for the people those who wants to perform full text search operations in their application, the Lucene is completely built using Java so it supports any application

When to use Lucene ? if you are expecting more control on the system then you can use Lucene library, but you must spend some additional amount of time to develop features which Lucene does not provide and Solr provides, very simple example is with Lucene to index PDF files, first you need to use Apache Tika PDFBox open source parser to get pdf content and send it to Document object by adding new text field

Solr

Solr is an enterprise search platform so you can use Solr to connect to different repositories such as flat files, cms, database etc…

How Solr index files ? What will happen when we run below command ?

The above command reads all pdf files from the download folder and upload into the Solr core_directory  folder to index content of pdf files

Acquire Content

PDF and word document’s comes under binary files, inorder to read content of binary files successfully Solr internally uses Apache Tika  which is a built in feature of  Apache Solr by using it’s ExtractRequestHandler

Build Documentsolrdocument

Apache Tika is a open source content extraction framework which is build on top of open source content extraction libraries such as PDFBox, Apache POI etc… the ExtractionHandler internally uses Apache Tika to find out mime type of uploaded file,based on the mime type the Tike loads respective parser, if  the mime type is applicaiton/pdf then the Tika will load PDFBox parser automatically

The respective parser  read’s content and metadata of uploaded file and build SolrInputDocument

Analyzer

Once parser build document the analyzer examines the text of fields and generate tokens, generating tokens depends on what Tokenizer you have specified in schema.xml for particular text field

The StandardTokenizerFactory splits the text fileds in to tokens based on whitespace and punctuation as delimeters, something like below

Index Document

The Apache Solr internally uses Lucene inverted index, most of the search engines uses inverted index data structure to achieve better performance, in the inverted index the search term’s will be having associated document id’s, in the below example once the user issues a query, it will search for the terms and the associated documents. It is the optimized way to get then fast search results from the search engine.

index

Leave a Reply

Your email address will not be published. Required fields are marked *

Visit Us On FacebookVisit Us On Google PlusVisit Us On Twitter