Solr is an enterprise search platform which is built on top of Lucene Java Search Library, the Lucene is a high-performance, scalable search library and it is best suitable for the people those who want to perform full-text search operations in their application, the Lucene is completely built using Java so it supports any application
When to use Lucene? if you are expecting more control on the system then you can use Lucene library, but you must spend some additional amount of time to develop features which Lucene does not provide and Solr provides, very simple example is with Lucene you can index PDF files, for this you need to use Apache Tika PDFBox open-source parser which gets content of the PDF file and send it to Document object by adding new text field
Solr is an enterprise search platform, so you can use Solr to connect to various repositories such as flat files, cms, database etc…
How Solr index files ? What will happen when we run below command ?
The above command reads all pdf files from the downloads folder and uploads into the Solr core_directory folder to index the content of pdf files, before indexing it follows different steps – refer above diagram
PDF and word documents are binary files, inorder to read the content of these binary files, Solr internally uses Apache Tika which is a built-in feature of Apache Solr by using its ExtractRequestHandler
Apache Tika is an open source content extraction framework which is built on top of open-source content extraction libraries such as PDFBox, Apache POI, etc… the ExtractionHandler internally uses Apache Tika to find out mime type of uploaded file, based on the mime type the Tike loads respective parser, if the mime-type is application/pdf then the Tika will load PDFBox parser automatically
The respective parser reads the content and metadata of uploaded file and build SolrInputDocument
Once parser build document the analyzer examines the text of fields and generate tokens, generating tokens depends on what Tokenizer you have specified in schema.xml for the particular text field
The Apache Solr internally uses Lucene inverted index, most of the search engines uses inverted index data structure to achieve better performance, in the inverted index the search term’s will be having associated document id’s, in the below example once the user issues a query, it will search for the terms and the associated documents. It is the optimized way to get then fast search results from the search engine.
No related posts.
A technology enthusiast and a professional blogger from India. Through out my IT career, I have had the pleasure of working on various new technologies and built products like www.ziprides.com and some other products. Unfortunately, my attempts have not given me the desired results and as a result, I have finally decided to build a professional blog where I would like to share all my learnings and hoping to learn from other enthusiasts around the world.