Solr Filters Syntax and Examples | Lucene and Solr Reference

Using a Solr Filter for Text Analysis

Beginner

Building a custom search application with Apache Solr is easy to get up and going, however fine-tuning it can be a challenge. There are many classes of code written to analyze text for a multitude of use cases, and the filter does a lot of the work in getting an enterprise search application ready for deployment. In the end, the goal with any natural language processing application is to help the user quickly find what they are looking for.

With the Apache Lucene engine sitting below front-end tools like Apache Solr and Elasticsearch, getting comfortable with Lucene indexing will help you make your enterprise search application more useful. It can also enhance your career.

The list of Solr filters below is not exhaustive, so access the Solr documentation for a full list. Here our focus is on steps the beginner should take in a test environment with help on how to move to production.

Apache Solr Reference

1. About Field Type Filters

During both the document indexing and search query steps, the combination of Lucene Solr tools analyze streams of data and turn them into tokens which are stored in the index, and used at query time.

Filters perform four types of tasks.

Normalization - removes accents and similar character markings.
Stop words - removes common words like "a", "an", "the", "and", "in", "on" which may not add value in an index.
Synonym expansion - adds synonyms at the same position in the document as the original term using an external synonyms file for fine tuning.
Stemming - replaces words with their stems. For example, the stem play applies to words like plays, played, and playing.

The filter is the third step of the analysis process. It is located in the XML-formatted schema as part of an analysis chain and follows the analyzer and tokenizer tags. After the tokenizer passes the token to the filter, the filter accesses a class of code, passes required arguments, processes the token stream and in the end, hopefully, makes the enterprise search experience better for users.

Since it is common for there to be multiple filters that each hand off a stream of tokens for futher processing, you should consider two points. First, the order matters and second, you should put the more general filters near the start.

2. Syntax for Filters

Below is an example of a multi-line text analyzer, including tokenizers and filters. It sits in the schema file named either schema.xml or managed-schema.

This two-step filter breaks tokens using stopwords and turns all uppercase text to lowercase. Note that single-line filter tags end with />.

3. Options for Filters and Their Actions

Below is a list of common filters by name and class name. Use this list in conjunction with analyzers and tokenizers.

Name and Class	Filtering Actions
Classic Filter `ClassicFilterFactory`	Removes possessive "'s" Removes periods in acronyms Removes periods at the end of a sentence
English Minimal Stem Filter `EnglishMinimalStemFilterFactory`	Removes plural "s" to create singular form
English Possessive Filter `EnglishPossessiveFilterFactory`	Removes possessive "'s"
Hyphenated Words Filter `HyphenatedWordsFilterFactory`	Combines words that are hyphenated
Keep Word Filter `KeepWordFilterFactory`	Retains only the tokens that are listed in a file identified with the argument words="keepwords.txt" Using the argument ignoreCase="true/false" with the default "false" assumes cases matter, and "true" assumes all keepwords are lowercased.
KStem Filter `KStemFilterFactory`	Applies the KStem stemming algorithm for words in English Less aggressive than the Porter stemmer
Lowercase Filter `LowerCaseFilterFactory`	Changes all uppercase letters to lowercase
Porter Stem Filter `PorterStemFilterFactory`	Applies the Porter stemming algorithm for words in English More aggressive than the KStem stemmer
Stop Filter `StopFilterFactory`	Removes tokens listed in a file identified with the argument words="stopwords.txt" Using the argument ignoreCase="true/false" with the default "false" assumes cases matter, and "true" assumes all stopwords are lowercased.

See the official Apache Solr and Lucene documentation for other available filters.

4. Examples of Filters

Rather than show each token here, because we aren't specifying the tokenizer step that exists before the filter, several examples below use full sentences.

Example 1 - Classic Filter

The ClassicFilterFactory class removes possessives, periods in acronymns and periods at the end of a sentence.

Input	Output
The U.S. Postal Services' workers simply know how to deliver.	The US Postal Services workers simply know how to deliver

Example 2 - English Minimal Stem Filter

The EnglishMinimalStemFilterFactory class removes plurals in English.

Input	Output
Building a search engine requires lots of work.	Building a search engine require lot of work.

Example 3 - English Possessive Filter

The EnglishPossessiveFilterFactory class removes possessives in English.

Input	Output
Solr's analyzer is used for indexing and at query time.	Solr analyzer is used for indexing and at query time.

Example 4 - Hyphenated Words Filter

The HyphenatedWordsFilterFactory class drops the hyphen from hyphenated words.

Input	Output
I've written about two-thirds of an E-mail to my ex-wife.	I've written about twothirds of an Email to my exwife.

Example 5 - KeepWord Filter

The KeepWordFilterFactory class uses an external file of words that will be kept in the index, so it is very restrictive. Here let's assume the file keepwords.txt includes the words: I, am, happy, glad, cheerful, excited, jolly, delighted, joyous.

Input	Output
I am far too jaded to go to work and pretend to be happy	I am happy

Example 6 - KStem Filter

The KStemFilterFactory class is a less aggressive stemmer than the Porter stemmer.

Input	Output
"bump", "bumped", "bumping"	"bump", "bump", "bump"

Example 7 - Lowercase Filter

The LowerCaseFilterFactory class changes all uppercase characters to lowercase.

Input	Output
These stocks have ridiculous PE ratios: Apple, Facebook, Google and Amazon	These stocks have ridiculous pe ratios: apple, facebook, google and amazon

Example 8 - Porter Stem Filter

The PorterStemFilterFactory class uses the Porter stemmer algorithm which is more aggressive than the KStem alternative.

Input	Output
"thump", "thumped", "thumping"	"thump", "thump", "thump"

Example 9 - Stop Filter

The StopFilterFactory class uses an external file called stopwords.txt. For this example let's assume the following words sit in that file: the, an, a, at, and, in, on, out.

Input	Output
I love to eat at In and Out Burger	I love to eat Burger

So as you can tell, filters are not perfect. Remember we are using a computer to attempt to perform natural language processing and errors do occur. With some practice and a lot of time, the accuracy may improve but at a cost of processing time. Remember, text typed into a search box must be interpreted, so every additional filter takes more time. You should consider this before attempting to add layers upon layers of text analysis.

Solr Filters : Syntax, Options and Examples