FactorPad
Faster Learning Tutorials

Solr Analyzers : syntax, options and examples

An analyzer in Solr is used to index documents and at query time to perform effective text analysis for users.
  1. About - Understand the purpose of an analyzer.
  2. Syntax - See how analyzers are coded in schema.xml or managed-schema.
  3. Options - view different classes and their delegated tokenizers and filters.
  4. Examples - review examples of commonly-used analyzers.
by Paul Alan Davis, CFA, November 12, 2017
Updated: July 16, 2018
Here we focus on the analyzer in XML format, but they can also be accessed from the Solr Admin UI or the Solr Schema API.

Outline Back Next

~/ home  / tech  / solr  / reference  / solr analyzers


Using a Solr Analyzer for Text Analysis

Beginner

When building a custom search application using Apache Solr you must incorporate text analysis into both indexing of documents and searching of the index. If you think about it, you never really know what a user might type into a search box in a website search application like one provided by Google Custom Search, for example. The analyzer must be equiped to take the input, create tokens and compare them to what is inside the index. What is inside the index came from another text stream which was processed by an analyzer. So you can see the analyzer operates on both sides of the search application.

The text analysis process is very complicated when you dive into it. The finer points about stemmers, stop words, synonyms and normalization are often best left to academics in the field of natural language processing. At the same time, we need a basic understanding, which is what we will gain here.

A common first step with developers is to build a test application in Apache Solr or Elasticsearch. Both sit on top of the Lucene library of field processing classes built with Java so knowledge of Lucene can be extended to both Solr search and Elasticsearh.

It isn't until after the developer fully tests the features of the Solr Lucene combination of programs that a production-quality application starts to take shape. Much of that time is spent within text analyzers and the two children processes within analyzers, called tokenizers and filters.

Apache Solr Reference

1. About Field Type Analyzers

As for a background, each field in Solr is assigned a fieldType and each fieldType has its own analyzer. So each analyzer is set up specifically for the fieldType, whether that be numeric, boolean or text.

The analyzer can be split into an "index" section for indexing, and a "query" section for the search phase. The analyzer in both situations turns a stream of text into tokens. It is recommended for beginners to use the same set of analyzer steps for the "index" and "query" processes and add customization later after adequate testing.

The analyzer points to a class. A pointer may be in the <fieldType> tag itself, or if the tag is closed without a class then subsequent child tags are identified as tokenizers and filters. In that case, the analyzer is a multi-part process.

2. Syntax for Analyzers

The following syntax is an example of a one-line fieldType analyzer located in either schema.xml or managed-schema.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer class="solr.StandardAnalyzerFactory"/> </fieldType>

Below is an example of a multi-line analyzer, including tokenizers and filters.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

Note that single-line analyzer tags end with the /> and multi-line analyzers are closed with a separate </analyzer> tag.

3. Options for Analyzers and Delegated Tokenizers and Filters

The following is a list of common analyzers by name. See the table in section 4 for the associated class name. Notice that analyzers may have one instruction or delegate tokenizing and filtering to other child classes. See our pages on tokenizers and filters for more information (links below).

Analyzer Name Tokenizer, Filter(s)
Classic Analyzer
  • Classic Tokenizer
  • Classic Filter
  • Lowercase Filter
  • Stop Filter
Keyword Analyzer
  • Creates a single token from the entire stream of text
  • Well suited for id fields and stuctured data
Simple Analyzer
  • Letter Tokenizer
  • Lowercase Filter
Standard Analyzer
  • Standard Tokenizer
  • Standard Filter
  • Lowercase Filter
  • Stop Filter
Stop Analyzer
  • Letter Tokenizer
  • Lowercase Filter
  • Stop Filter
Whitespace Analyzer
  • Whitespace Tokenizer

There are many other analyzers for languages besides English.

4. Examples of Analyzers

Because most of the analyzers here refer to child processes performed by tokenizers and filters, it is best to toggle between the pages to get a grasp of what each step does. What is shown within double-quotes is the token. The token is added to the index or used to retrieve information from the index at query time.

Example 1 - Classic Analyzer

The ClassicAnalyzerFactory class tokenizes using the Classic Tokenizer and filters using the Classic Filter, Lowercase Filter and the Stop Filter.

Input Output
I've gotten 5 E-mails from joe@example.com. "i", "ve", "gotten", "5", "e", "mails", "from" "joe@example.com"

Of course results from the Stop Filter will vary depending on the words located in the stopwords.txt file.

Example 2 - Keyword Analyzer

The KeywordAnalyzerFactory class tokenizes the whole text stream as one term.

Input Output
Washington DC "Washington DC"
Example 3 - Simple Analyzer

The SimpleAnalyzerFactory class delegates to the Letter Tokenizer and the Lowercase Filter.

Input Output
The MLB is a long-lived monopoly. The most recent challenge went to the U.S. Supreme Court in 1972. "The", "mlb", "is", "a", "long", "lived", "monopoly", "the", "most", "recent", "challenge", "went", "to", "the", "u", "s", "supreme", "court", "in"
Example 4 - Standard Analyzer

The StandardAnalyzerFactory class uses the Standard Tokenizer and three filters: Standard Filter, Lowercase Filter and Stop Filter. For this example, assume the following words sit in the stopwords.txt file: "they", "it", "or", "the".

Input Output
They call it N.Y.C., NY, New York, Number-1 City or the Big Apple. "call", "n.", "y.", c", "ny", "new", "york", "number", "1", "city", "big", "apple"

Apache Solr documentation indicates that the Standard Filter no longer operates when the luceneMatchVersion setting in the solrconfig.xml is higher than 3.1.

Example 5 - Stop Analyzer

The StopAnalyzerFactory class uses the Letter Tokenizer, Lowercase Filter and the Stop Filter. For this example, assume the following words sit in the stopwords.txt file: "the", "a".

Input Output
The green fox jumped over a log. Wait, a green fox? "green", "fox", "jumped", "over", "log", "wait", "green", "fox"
Example 6 - Whitespace Analyzer

The WhitespaceAnalyzerFactory class uses the Whitespace Tokenizer.

Input Output
Wouldn't you want to break this 1-sentence up differintly? "Wouldn't", "you", "want", "to", "break", "this", "1-sentence", "up", "differintly?"

The misspelled word in the last example, plus the unnecessary symbols created as tokens, should clarify the point that the analysis process is not, and will never be, perfect. Over time and with experience, the developer will weigh the costs of added layers of analysis against the benefits of added accuracy. This is really where the text analysis process begins to take shape.


Other Related Solr Content

FactorPad offers Apache Solr Search content in both tutorials and reference.


What's Next?

Our YouTube Channel is growing. Be a part of the fun. Subscribe here.

  • To see the outline of Solr reference material, click Outline.
  • To learn about Solr Field properties, click Back.
  • To see the Tokenizers mentioned here, Click Next.

Outline Back Next

~/ home  / tech  / solr  / reference  / solr analyzers



 
 
Keywords:
solr reference
apache solr
solr search
custom search
enterprise search
apache lucene
lucene reference
solr examples
solr analyzers
solr syntax
solr help
text analysis
google custom search
elasticsearch
solr settings
solr configuration
solr schema