Solr is an enterprise search platform which is build on top of Lucene Java Search Library , the Lucene is a high-performance , scalable search library and it is best suitable for the people those who wants to perform full text search operations in their application, the Lucene is completely built using Java so it supports any application
When to use Lucene ? if you are expecting more control on the system then you can use Lucene library, but you must spend some additional amount of time to develop features which Lucene does not provide and Solr provides, very simple example is with Lucene to index PDF files, first you need to use Apache Tika PDFBox open source parser to get pdf content and send it to Document object by adding new text field
Solr is an enterprise search platform so you can use Solr to connect to different repositories such as flat files, cms, database etc…
How Solr index files ? What will happen when we run below command ?
java -Dc=core_directory -jar C:\Users\greyarea\Downloads\post.jar -c core_directory C:\Users\greyarea\Downloads\*.pdf
The above command reads all pdf files from the download folder and upload into the Solr core_directory folder to index content of pdf files
PDF and word document’s comes under binary files, inorder to read content of binary files successfully Solr internally uses Apache Tika which is a built in feature of Apache Solr by using it’s ExtractRequestHandler
Apache Tika is a open source content extraction framework which is build on top of open source content extraction libraries such as PDFBox, Apache POI etc… the ExtractionHandler internally uses Apache Tika to find out mime type of uploaded file,based on the mime type the Tike loads respective parser, if the mime type is applicaiton/pdf then the Tika will load PDFBox parser automatically
The respective parser read’s content and metadata of uploaded file and build SolrInputDocument
Once parser build document the analyzer examines the text of fields and generate tokens, generating tokens depends on what Tokenizer you have specified in schema.xml for particular text field
<fieldType name="text" class="solr.TextField">
The StandardTokenizerFactory splits the text fileds in to tokens based on whitespace and punctuation as delimeters, something like below
[How] [solr] [works] [?] [Solr] [is] [an] [enterprise] [search] [engine] [platform]
The Apache Solr internally uses Lucene inverted index, most of the search engines uses inverted index data structure to achieve better performance, in the inverted index the search term’s will be having associated document id’s, in the below example once the user issues a query, it will search for the terms and the associated documents. It is the optimized way to get then fast search results from the search engine.