FactorPad
Faster Learning Tutorials

Solr Filters : syntax, options and examples

A filter in Solr fine-tunes token streams to clean up messy input and improve text analysis.
  1. About - Understand the purpose of a filter; including, token normalization, stop word removal, synonym expansion and word stemming.
  2. Syntax - See how filters are coded in schema.xml or managed-schema.
  3. Options - View common classes with descriptions of their purpose.
  4. Examples - Review examples of commonly-used filters.
by Paul Alan Davis, CFA, November 12, 2017
Updated: July 16, 2018
A filter is the third element of the analyzer chain, and there can be several filters strung together. Let's see how it works.

Outline Back Next

~/ home  / tech  / solr  / reference  / solr filter


Using a Solr Filter for Text Analysis

Beginner

Building a custom search application with Apache Solr is easy to get up and going, however fine-tuning it can be a challenge. There are many classes of code written to analyze text for a multitude of use cases, and the filter does a lot of the work in getting an enterprise search application ready for deployment. In the end, the goal with any natural language processing application is to help the user quickly find what they are looking for.

With the Apache Lucene engine sitting below front-end tools like Apache Solr and Elasticsearch, getting comfortable with Lucene indexing will help you make your enterprise search application more useful. It can also enhance your career.

The list of Solr filters below is not exhaustive, so access the Solr documentation for a full list. Here our focus is on steps the beginner should take in a test environment with help on how to move to production.

Apache Solr Reference

1. About Field Type Filters

During both the document indexing and search query steps, the combination of Lucene Solr tools analyze streams of data and turn them into tokens which are stored in the index, and used at query time.

Filters perform four types of tasks.

  1. Normalization - removes accents and similar character markings.
  2. Stop words - removes common words like "a", "an", "the", "and", "in", "on" which may not add value in an index.
  3. Synonym expansion - adds synonyms at the same position in the document as the original term using an external synonyms file for fine tuning.
  4. Stemming - replaces words with their stems. For example, the stem play applies to words like plays, played, and playing.

The filter is the third step of the analysis process. It is located in the XML-formatted schema as part of an analysis chain and follows the analyzer and tokenizer tags. After the tokenizer passes the token to the filter, the filter accesses a class of code, passes required arguments, processes the token stream and in the end, hopefully, makes the enterprise search experience better for users.

Since it is common for there to be multiple filters that each hand off a stream of tokens for futher processing, you should consider two points. First, the order matters and second, you should put the more general filters near the start.

2. Syntax for Filters

Below is an example of a multi-line text analyzer, including tokenizers and filters. It sits in the schema file named either schema.xml or managed-schema.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

This two-step filter breaks tokens using stopwords and turns all uppercase text to lowercase. Note that single-line filter tags end with />.

3. Options for Filters and Their Actions

Below is a list of common filters by name and class name. Use this list in conjunction with analyzers and tokenizers.

Name and Class Filtering Actions
Classic Filter
ClassicFilterFactory
  • Removes possessive "'s"
  • Removes periods in acronyms
  • Removes periods at the end of a sentence
English Minimal Stem Filter
EnglishMinimalStemFilterFactory
  • Removes plural "s" to create singular form
English Possessive Filter
EnglishPossessiveFilterFactory
  • Removes possessive "'s"
Hyphenated Words Filter
HyphenatedWordsFilterFactory
  • Combines words that are hyphenated
Keep Word Filter
KeepWordFilterFactory
  • Retains only the tokens that are listed in a file identified with the argument words="keepwords.txt"
  • Using the argument ignoreCase="true/false" with the default "false" assumes cases matter, and "true" assumes all keepwords are lowercased.
KStem Filter
KStemFilterFactory
  • Applies the KStem stemming algorithm for words in English
  • Less aggressive than the Porter stemmer
Lowercase Filter
LowerCaseFilterFactory
  • Changes all uppercase letters to lowercase
Porter Stem Filter
PorterStemFilterFactory
  • Applies the Porter stemming algorithm for words in English
  • More aggressive than the KStem stemmer
Stop Filter
StopFilterFactory
  • Removes tokens listed in a file identified with the argument words="stopwords.txt"
  • Using the argument ignoreCase="true/false" with the default "false" assumes cases matter, and "true" assumes all stopwords are lowercased.

See the official Apache Solr and Lucene documentation for other available filters.

4. Examples of Filters

Rather than show each token here, because we aren't specifying the tokenizer step that exists before the filter, several examples below use full sentences.

Example 1 - Classic Filter

The ClassicFilterFactory class removes possessives, periods in acronymns and periods at the end of a sentence.

Input Output
The U.S. Postal Services' workers simply know how to deliver. The US Postal Services workers simply know how to deliver
Example 2 - English Minimal Stem Filter

The EnglishMinimalStemFilterFactory class removes plurals in English.

Input Output
Building a search engine requires lots of work. Building a search engine require lot of work.
Example 3 - English Possessive Filter

The EnglishPossessiveFilterFactory class removes possessives in English.

Input Output
Solr's analyzer is used for indexing and at query time. Solr analyzer is used for indexing and at query time.
Example 4 - Hyphenated Words Filter

The HyphenatedWordsFilterFactory class drops the hyphen from hyphenated words.

Input Output
I've written about two-thirds of an E-mail to my ex-wife. I've written about twothirds of an Email to my exwife.
Example 5 - KeepWord Filter

The KeepWordFilterFactory class uses an external file of words that will be kept in the index, so it is very restrictive. Here let's assume the file keepwords.txt includes the words: I, am, happy, glad, cheerful, excited, jolly, delighted, joyous.

Input Output
I am far too jaded to go to work and pretend to be happy I am happy
Example 6 - KStem Filter

The KStemFilterFactory class is a less aggressive stemmer than the Porter stemmer.

Input Output
"bump", "bumped", "bumping" "bump", "bump", "bump"
Example 7 - Lowercase Filter

The LowerCaseFilterFactory class changes all uppercase characters to lowercase.

Input Output
These stocks have ridiculous PE ratios: Apple, Facebook, Google and Amazon These stocks have ridiculous pe ratios: apple, facebook, google and amazon
Example 8 - Porter Stem Filter

The PorterStemFilterFactory class uses the Porter stemmer algorithm which is more aggressive than the KStem alternative.

Input Output
"thump", "thumped", "thumping" "thump", "thump", "thump"
Example 9 - Stop Filter

The StopFilterFactory class uses an external file called stopwords.txt. For this example let's assume the following words sit in that file: the, an, a, at, and, in, on, out.

Input Output
I love to eat at In and Out Burger I love to eat Burger

So as you can tell, filters are not perfect. Remember we are using a computer to attempt to perform natural language processing and errors do occur. With some practice and a lot of time, the accuracy may improve but at a cost of processing time. Remember, text typed into a search box must be interpreted, so every additional filter takes more time. You should consider this before attempting to add layers upon layers of text analysis.


Other Related Solr Content

FactorPad offers Apache Solr Search content in both tutorials and reference.


What's Next?

If you learned something, please consider joining our YouTube Channel. Subscribe here.

  • To see the outline of Solr reference material, click Outline.
  • To learn about tokenizers, click Back.
  • To learn how Solr cell document parsing works, click Next.

Outline Back Next

~/ home  / tech  / solr  / reference  / solr filters



 
 
Keywords:
solr reference
apache solr
solr search
custom search
text analysis
enterprise search
apache lucene
lucene reference
solr examples
solr filters
solr syntax
solr help
solr stemmer
porter stemmer
kstem stemmer
elasticsearch
solr synonyms
solr normalization
solr stopwords