Solr Analyzers Syntax and Examples | Lucene and Solr Reference

Using a Solr Analyzer for Text Analysis

Beginner

When building a custom search application using Apache Solr you must incorporate text analysis into both indexing of documents and searching of the index. If you think about it, you never really know what a user might type into a search box in a website search application like one provided by Google Custom Search, for example. The analyzer must be equiped to take the input, create tokens and compare them to what is inside the index. What is inside the index came from another text stream which was processed by an analyzer. So you can see the analyzer operates on both sides of the search application.

The text analysis process is very complicated when you dive into it. The finer points about stemmers, stop words, synonyms and normalization are often best left to academics in the field of natural language processing. At the same time, we need a basic understanding, which is what we will gain here.

A common first step with developers is to build a test application in Apache Solr or Elasticsearch. Both sit on top of the Lucene library of field processing classes built with Java so knowledge of Lucene can be extended to both Solr search and Elasticsearh.

It isn't until after the developer fully tests the features of the Solr Lucene combination of programs that a production-quality application starts to take shape. Much of that time is spent within text analyzers and the two children processes within analyzers, called tokenizers and filters.

Apache Solr Reference

1. About Field Type Analyzers

As for a background, each field in Solr is assigned a fieldType and each fieldType has its own analyzer. So each analyzer is set up specifically for the fieldType, whether that be numeric, boolean or text.

The analyzer can be split into an "index" section for indexing, and a "query" section for the search phase. The analyzer in both situations turns a stream of text into tokens. It is recommended for beginners to use the same set of analyzer steps for the "index" and "query" processes and add customization later after adequate testing.

The analyzer points to a class. A pointer may be in the <fieldType> tag itself, or if the tag is closed without a class then subsequent child tags are identified as tokenizers and filters. In that case, the analyzer is a multi-part process.

2. Syntax for Analyzers

The following syntax is an example of a one-line fieldType analyzer located in either schema.xml or managed-schema.

Below is an example of a multi-line analyzer, including tokenizers and filters.

Note that single-line analyzer tags end with the /> and multi-line analyzers are closed with a separate </analyzer> tag.

3. Options for Analyzers and Delegated Tokenizers and Filters

The following is a list of common analyzers by name. See the table in section 4 for the associated class name. Notice that analyzers may have one instruction or delegate tokenizing and filtering to other child classes. See our pages on tokenizers and filters for more information (links below).

Analyzer Name	Tokenizer, Filter(s)
Classic Analyzer	Classic Tokenizer Classic Filter Lowercase Filter Stop Filter
Keyword Analyzer	Creates a single token from the entire stream of text Well suited for id fields and stuctured data
Simple Analyzer	Letter Tokenizer Lowercase Filter
Standard Analyzer	Standard Tokenizer Standard Filter Lowercase Filter Stop Filter
Stop Analyzer	Letter Tokenizer Lowercase Filter Stop Filter
Whitespace Analyzer	Whitespace Tokenizer

There are many other analyzers for languages besides English.

4. Examples of Analyzers

Because most of the analyzers here refer to child processes performed by tokenizers and filters, it is best to toggle between the pages to get a grasp of what each step does. What is shown within double-quotes is the token. The token is added to the index or used to retrieve information from the index at query time.

Example 1 - Classic Analyzer

The ClassicAnalyzerFactory class tokenizes using the Classic Tokenizer and filters using the Classic Filter, Lowercase Filter and the Stop Filter.

Input	Output
I've gotten 5 E-mails from joe@example.com.	"i", "ve", "gotten", "5", "e", "mails", "from" "joe@example.com"

Of course results from the Stop Filter will vary depending on the words located in the stopwords.txt file.

Example 2 - Keyword Analyzer

The KeywordAnalyzerFactory class tokenizes the whole text stream as one term.

Input	Output
Washington DC	"Washington DC"

Example 3 - Simple Analyzer

The SimpleAnalyzerFactory class delegates to the Letter Tokenizer and the Lowercase Filter.

Input	Output
The MLB is a long-lived monopoly. The most recent challenge went to the U.S. Supreme Court in 1972.	"The", "mlb", "is", "a", "long", "lived", "monopoly", "the", "most", "recent", "challenge", "went", "to", "the", "u", "s", "supreme", "court", "in"

Example 4 - Standard Analyzer

The StandardAnalyzerFactory class uses the Standard Tokenizer and three filters: Standard Filter, Lowercase Filter and Stop Filter. For this example, assume the following words sit in the stopwords.txt file: "they", "it", "or", "the".

Input	Output
They call it N.Y.C., NY, New York, Number-1 City or the Big Apple.	"call", "n.", "y.", c", "ny", "new", "york", "number", "1", "city", "big", "apple"

Apache Solr documentation indicates that the Standard Filter no longer operates when the luceneMatchVersion setting in the solrconfig.xml is higher than 3.1.

Example 5 - Stop Analyzer

The StopAnalyzerFactory class uses the Letter Tokenizer, Lowercase Filter and the Stop Filter. For this example, assume the following words sit in the stopwords.txt file: "the", "a".

Input	Output
The green fox jumped over a log. Wait, a green fox?	"green", "fox", "jumped", "over", "log", "wait", "green", "fox"

Example 6 - Whitespace Analyzer

The WhitespaceAnalyzerFactory class uses the Whitespace Tokenizer.

Input	Output
Wouldn't you want to break this 1-sentence up differintly?	"Wouldn't", "you", "want", "to", "break", "this", "1-sentence", "up", "differintly?"

The misspelled word in the last example, plus the unnecessary symbols created as tokens, should clarify the point that the analysis process is not, and will never be, perfect. Over time and with experience, the developer will weigh the costs of added layers of analysis against the benefits of added accuracy. This is really where the text analysis process begins to take shape.

Solr Analyzers : Syntax, Options and Examples