When building a custom search application using Apache Solr you must incorporate text analysis into both indexing of documents and searching of the index. If you think about it, you never really know what a user might type into a search box in a website search application like one provided by Google Custom Search, for example. The analyzer must be equiped to take the input, create tokens and compare them to what is inside the index. What is inside the index came from another text stream which was processed by an analyzer. So you can see the analyzer operates on both sides of the search application.
The text analysis process is very complicated when you dive into it. The finer points about stemmers, stop words, synonyms and normalization are often best left to academics in the field of natural language processing. At the same time, we need a basic understanding, which is what we will gain here.
A common first step with developers is to build a test application in Apache Solr or Elasticsearch. Both sit on top of the Lucene library of field processing classes built with Java so knowledge of Lucene can be extended to both Solr search and Elasticsearh.
It isn't until after the developer fully tests the features of the Solr Lucene combination of programs that a production-quality application starts to take shape. Much of that time is spent within text analyzers and the two children processes within analyzers, called tokenizers and filters.
As for a background, each field in Solr is assigned a fieldType and each fieldType has its own analyzer. So each analyzer is set up specifically for the fieldType, whether that be numeric, boolean or text.
The analyzer can be split into an "index" section for indexing, and a "query" section for the search phase. The analyzer in both situations turns a stream of text into tokens. It is recommended for beginners to use the same set of analyzer steps for the "index" and "query" processes and add customization later after adequate testing.
The analyzer points to a class. A pointer may be in
<fieldType> tag itself, or
if the tag is closed without a class then subsequent child tags are
identified as tokenizers and filters. In that case, the analyzer is a
The following syntax is an example of a one-line
fieldType analyzer located in either
Below is an example of a multi-line analyzer, including tokenizers and filters.
Note that single-line analyzer tags end with the
/> and multi-line analyzers are
closed with a separate
The following is a list of common analyzers by name. See the table in section 4 for the associated class name. Notice that analyzers may have one instruction or delegate tokenizing and filtering to other child classes. See our pages on tokenizers and filters for more information (links below).
|Analyzer Name||Tokenizer, Filter(s)|
There are many other analyzers for languages besides English.
Because most of the analyzers here refer to child processes performed by tokenizers and filters, it is best to toggle between the pages to get a grasp of what each step does. What is shown within double-quotes is the token. The token is added to the index or used to retrieve information from the index at query time.
tokenizes using the Classic Tokenizer and filters
using the Classic Filter,
Lowercase Filter and the Stop Filter.
|I've gotten 5 E-mails from email@example.com.||"i", "ve", "gotten", "5", "e", "mails", "from" "firstname.lastname@example.org"|
Of course results from the Stop Filter will vary depending on the words located in the stopwords.txt file.
tokenizes the whole text stream as one term.
|Washington DC||"Washington DC"|
delegates to the Letter Tokenizer and the
|The MLB is a long-lived monopoly. The most recent challenge went to the U.S. Supreme Court in 1972.||"The", "mlb", "is", "a", "long", "lived", "monopoly", "the", "most", "recent", "challenge", "went", "to", "the", "u", "s", "supreme", "court", "in"|
StandardAnalyzerFactory class uses
the Standard Tokenizer and three filters:
Standard Filter, Lowercase Filter and
Stop Filter. For this example, assume the following
words sit in the stopwords.txt file:
"they", "it", "or", "the".
|They call it N.Y.C., NY, New York, Number-1 City or the Big Apple.||"call", "n.", "y.", c", "ny", "new", "york", "number", "1", "city", "big", "apple"|
Apache Solr documentation indicates that the Standard
Filter no longer operates when the
luceneMatchVersion setting in the
solrconfig.xml is higher than 3.1.
StopAnalyzerFactory class uses
the Letter Tokenizer, Lowercase Filter
and the Stop Filter. For this example, assume the
following words sit in the stopwords.txt file:
|The green fox jumped over a log. Wait, a green fox?||"green", "fox", "jumped", "over", "log", "wait", "green", "fox"|
uses the Whitespace Tokenizer.
|Wouldn't you want to break this 1-sentence up differintly?||"Wouldn't", "you", "want", "to", "break", "this", "1-sentence", "up", "differintly?"|
The misspelled word in the last example, plus the unnecessary symbols created as tokens, should clarify the point that the analysis process is not, and will never be, perfect. Over time and with experience, the developer will weigh the costs of added layers of analysis against the benefits of added accuracy. This is really where the text analysis process begins to take shape.
FactorPad offers Apache Solr Search content in both tutorials and reference.
Our YouTube Channel is growing. Be a part of the fun. Subscribe here.