/ factorpad.com / tech / solr / reference / solr-filters.html
An ad-free and cookie-free website.
Beginner
Building a custom search application with Apache Solr is easy to get up and going, however fine-tuning it can be a challenge. There are many classes of code written to analyze text for a multitude of use cases, and the filter does a lot of the work in getting an enterprise search application ready for deployment. In the end, the goal with any natural language processing application is to help the user quickly find what they are looking for.
With the Apache Lucene engine sitting below front-end tools like Apache Solr and Elasticsearch, getting comfortable with Lucene indexing will help you make your enterprise search application more useful. It can also enhance your career.
The list of Solr filters below is not exhaustive, so access the Solr documentation for a full list. Here our focus is on steps the beginner should take in a test environment with help on how to move to production.
During both the document indexing and search query steps, the combination of Lucene Solr tools analyze streams of data and turn them into tokens which are stored in the index, and used at query time.
Filters perform four types of tasks.
The filter is the third step of the analysis process. It is located in the XML-formatted schema as part of an analysis chain and follows the analyzer and tokenizer tags. After the tokenizer passes the token to the filter, the filter accesses a class of code, passes required arguments, processes the token stream and in the end, hopefully, makes the enterprise search experience better for users.
Since it is common for there to be multiple filters that each hand off a stream of tokens for futher processing, you should consider two points. First, the order matters and second, you should put the more general filters near the start.
Below is an example of a multi-line text analyzer, including tokenizers
and filters. It sits in the schema file named either
schema.xml
or
managed-schema
.
This two-step filter breaks tokens using stopwords and turns
all uppercase text to lowercase. Note that single-line filter tags end
with />
.
Below is a list of common filters by name and class name. Use this list in conjunction with analyzers and tokenizers.
Name and Class | Filtering Actions |
---|---|
Classic FilterClassicFilterFactory
|
|
English Minimal Stem FilterEnglishMinimalStemFilterFactory
|
|
English Possessive FilterEnglishPossessiveFilterFactory
|
|
Hyphenated Words FilterHyphenatedWordsFilterFactory
|
|
Keep Word FilterKeepWordFilterFactory
|
|
KStem FilterKStemFilterFactory
|
|
Lowercase FilterLowerCaseFilterFactory
|
|
Porter Stem FilterPorterStemFilterFactory
|
|
Stop FilterStopFilterFactory
|
|
See the official Apache Solr and Lucene documentation for other available filters.
Rather than show each token here, because we aren't specifying the tokenizer step that exists before the filter, several examples below use full sentences.
The ClassicFilterFactory
class removes
possessives, periods in acronymns and periods at the end of a sentence.
Input | Output |
---|---|
The U.S. Postal Services' workers simply know how to deliver. | The US Postal Services workers simply know how to deliver |
The EnglishMinimalStemFilterFactory
class removes plurals in English.
Input | Output |
---|---|
Building a search engine requires lots of work. | Building a search engine require lot of work. |
The EnglishPossessiveFilterFactory
class removes possessives in English.
Input | Output |
---|---|
Solr's analyzer is used for indexing and at query time. | Solr analyzer is used for indexing and at query time. |
The HyphenatedWordsFilterFactory
class drops the hyphen from hyphenated words.
Input | Output |
---|---|
I've written about two-thirds of an E-mail to my ex-wife. | I've written about twothirds of an Email to my exwife. |
The KeepWordFilterFactory
class uses an
external file of words that will be kept in the index, so it is very
restrictive. Here let's assume the file
keepwords.txt
includes the
words: I, am, happy, glad, cheerful, excited, jolly, delighted, joyous.
Input | Output |
---|---|
I am far too jaded to go to work and pretend to be happy | I am happy |
The KStemFilterFactory
class is a
less aggressive stemmer than the Porter stemmer.
Input | Output |
---|---|
"bump", "bumped", "bumping" | "bump", "bump", "bump" |
The LowerCaseFilterFactory
class
changes all uppercase characters to lowercase.
Input | Output |
---|---|
These stocks have ridiculous PE ratios: Apple, Facebook, Google and Amazon | These stocks have ridiculous pe ratios: apple, facebook, google and amazon |
The PorterStemFilterFactory
class uses
the Porter stemmer algorithm which is more aggressive than the KStem
alternative.
Input | Output |
---|---|
"thump", "thumped", "thumping" | "thump", "thump", "thump" |
The StopFilterFactory
class uses an
external file called stopwords.txt
.
For this example let's assume the following words sit in that file:
the, an, a, at, and, in, on, out.
Input | Output |
---|---|
I love to eat at In and Out Burger | I love to eat Burger |
So as you can tell, filters are not perfect. Remember we are using a computer to attempt to perform natural language processing and errors do occur. With some practice and a lot of time, the accuracy may improve but at a cost of processing time. Remember, text typed into a search box must be interpreted, so every additional filter takes more time. You should consider this before attempting to add layers upon layers of text analysis.
FactorPad offers Apache Solr Search content in both tutorials and reference.
If you learned something, please consider joining our YouTube Channel. Subscribe and follow @factorpad so you don't lose this valuable resource.
/ factorpad.com / tech / solr / reference / solr-filters.html
A newly-updated free resource. Connect and refer a friend today.