Solr Analyzer - Text Analysis with Lucene Analyzers, Tokenizers and Filters

How Text Analysis is Translated to Apache Solr Indexing and Queries

Beginner

In our last episode, we incorporated learnings from previous tutorials. We went from from building a core to posting a document to the index and we left off with the analysis step. There we touched on how the Lucene Solr combination of programs takes a sentence and analyzes it, breaks up words into tokens and filters out unnecessary characters.

To me, this is when things get fascinating as we approach the academic aspects of language analysis. This is the first part of the Solr documentation I read before deciding to invest my time in learning Solr. My background is in quantitiatve equity portfolio management, and there the focus is on numbers, and here we are focused on words, but there similarities. Algorithms sort through information, cleaning up data and allocating weights to an eventual decision. With search, documents are scored based upon what a user enters in a search box. We may not get into the nitty gritty here, but we'll start to see some of the fascinating complexity.

And for backdrop on the series, we are working in a test environment here, keeping things simple, not focusing on more advanced topics like distributed search or building out a production environment just yet. Instead, we are working with a core called solrhelp and seeing how to tell Solr, or more accurately the engine behind Solr, Lucene, to make documents searchable. Here we will cover how it breaks up the lines of text and in the next video we will talk about the fields and field types themselves, including their classes and properties.

What is the opportunity?

We also talked about changes coming in 2018, where website search and enterprise search is taking on greater importance. With offerings like the Google Search Appliance and Google Site Search coming to an end, the window that is closing for many, creates a window of opportunity for others who learn text analysis tools like Apache Solr and Elasticsearch.

At the same time, Google did raise the bar. Users expect search results to match what they're used to on Google.com and that is no easy task. Think about the teams of developers around the globe building systems to analyze languages. Plus how Google distributes and caches data translates to incredible seach query speed. This is not an inexpensive undertaking and explains their success and why customers expect a lot from a search application.

Of course we can't match that, but with a little work in text analysis here, we can get pretty close. And most of us aren't trying to compete with Google anyway, instead we just want an effective tool for website search.

So with that as a backdrop, let's have some fun with text analysis in Apache Solr Search.

Apache Solr in Video

Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).

Solr Analyzer - Text Analysis with Lucene Analyzers, Tokenizers and Filters (15:37)

For Those Just Starting Out

Step 1 - Pick Up Where We Left Off in the Solr Admin UI

Okay for Step 1, let's look at where we left off. We indexed one document from a web crawl. To clarify, this is our second data set. In the first we used structured data like you might find in a database, and here we are using unstructured data like text on a website. We also used a "Schemaless" configuration that is Solr's way to allow beginners to get up and going without having to mess with schema files.

The advantage is speed and simplicity. The disadvantage is that often Solr will create fields you don't actually need, leading to a little index bloat and slowing the search application down. Here we will see how more advanced users customize the schema.

Analyze a sentence using the text_en Field Type

In the last tutorial, we kicked off the topic of Fields and FieldTypes, which are set up in the file called managed-schema which is the schema that Solr automatically created during the bin/solr create_core step and modified by Solr automatically during the bin/post step. Now we want to explore those modifications.

In total, Solr created 17 Fields and 63 fieldTypes and no we won't go through all of them. The point is that each Field is assigned a fieldType and each one of those has its own field analyzer which can be broken up into one or many different sequential steps.

And we don't really need to focus on numbers or booleans because they are easily interpreted, 9 or 10, yes or no; respectively. So these fieldTypes only require one line in the schema and hardly any analysis at all.

However, with the focus on text analysis here, we have a more complex process because text, whether analyzed during the indexing phase, or the user's search query phase, can include many different languages, capitalizations, punctuations, plurals, synonyms, word stems and even typos. And all of these are programmed by grabbing a class of code, with the pointers that sit in the managed-schema or schema.xml file.

Let's look at a more complicated fieldType, one with six steps, in the dropdown, called text_en. Let's see how this one analyzer works during the indexing phase, rather than the query phase, by typing a sentence and seeing which tokens Solr creates in the index. Typing Apple's success is because Apples' coders ate apples. Uncheck the Verbose Output box and hit the Analyze Values button and here we can see a table listing the six sequential steps in the text analysis process.

Step	1	2	3	4	5	6	7	8
Original	Apple's	success	is	because	Apples'	coders	ate	apples.
ST	Apple's	success	is	because	Apples	coders	ate	apples
SF	Apple's	success		because	Apples	coders	ate	apples
LCF	apple's	success		because	apples	coders	ate	apples
EPF	apple	success		because	apples	coders	ate	apples
SKMF	apple	success		because	apples	coders	ate	apples
PSF	appl	success		becaus	appl	coder	at	appl

Down the left column are abbreviations for the 1 tokenizer and 5 filters.

ST - The StandardTokenizer broke the sentence up into terms, or tokens, and also removed the trailing single quote on Apples and the period on the end of the sentence.
SF - The Stop Filter stripped out the term "is" which is also known as a stop word, like "a", "as", "an" and "the". It is up to the designer of the search application of course, but many people don't see value in having these in the index so they are often removed.
LCF - The Lower Case Filter changed all of the capitalized letters to lower case.
EPF - The English Possesive Filter removed the possesive in the token "apple's".
SKMF - The Set Keyword Marker Filter made no modifications here.
PSF - The Porter Stem Filter cut words down to their stems, which in the case of "ate" changing to "at" might not be ideal.

A couple points here are important. First, is to just see how the process works in a chain, each step does its work and then successively hands off the stream of tokens to the next step for more processing.

Second, is the point I mentioned about why beginners typically go with Solr "Schemaless" mode. Going beyond that to build a search application for a production environment will require that you learn each step, and frankly a lot of people just don't have the time or budget for that.

Okay, with that let's take a step back for a moment so we are clear before examining the XML files.

Step 2 - Field Analyzers, Tokenizers and Filters

In Step 2, let's get to know analyzers, tokenizers and filters a bit better.

Analyzer - An analyzer is a parent tag in XML that delegates text processing to tokenizers and filters.
Tokenizer - A tokenizer is responsible for breaking up text strings into "tokens", as we saw earlier.
Filter - A filter is responsible for cleaning up each token.

Filters perform four types of tasks.

Normalization - removes accents and similar character markings.
Stop words - removes unnecessary words.
Synonym expansion - adds synonyms, for example, if you wanted search results to appear whether the user entered Unix or Linux you could do that.
Stemming - replaces words with stems. For example, play applies to words like play, played, and playing.

Step 3 - Detail the Structure of an Analyzer in XML

For Step 3, now that we understand the analyzer is the parent that delegates work to tokenizers and filters, let's see how it is structured.

One-Line Analyzer

In our first example, we start with a one-line analyzer.

In this example the analyzer sits within the tags for the fieldType named text_general. In this case it is a one-line analyzer, so it is closed off with the / at the end because this is XML. Second, the StandardAnalyzerFactory class has built into it a tokenizer and several filters. So, the point is, you can use a preconfigured analyzer without having to customize the tokenizers and filters that sit below them.

Multi-Line Analyzer

Next, let's look at an analyzer parent with multiple children that can be customized.

Okay, let's make a few observations here. Notice how the analyzer tag here doesn't point to a class. Instead it delegates to three children, one tokenizer and two filters. It also doesn't have that closing / and is instead closed with a separate analyzer tag.

Now that we have seen two fairly basic analyzer setups, let's move on and look at a more complex and realistic one.

Step 4 - Look at an Example Created with our "Schemaless" Configuration

For Step 4, let's get a real-life example created automatically when we posted one HTML file to the index from the last tutorial. This will give us an idea as to how we could customize the schema in XML.

Review the managed-schema file as it relates to the text_en Field Type

We are sitting in the installation directory and in my case this is solr-7.0.0 and from here we can access the managed-schema file that sits in the configuration directory for the core. We are not going to edit this now, so in my text editor vim I will use the -R flag for readonly mode.

$ vim -R server/solr/solrhelp/conf/managed-schema

<?xml version="1.0" encoding="UTF-8"?>  <schema name="default-config" version="1.6"> ... <fieldType name='text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossesiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true"/ synonyms="synonyms.txt"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossesiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <fieldType> ... </schema>

First, on the structure of this file, the first line identifies the XML format. Second, is a reminder that this file shouldn't be manually edited. Third is the schema tag and its version which will have more meaning in the next tutorial. And finally, to give you an idea as to complexity, there are 509 lines in this managed-schema file.

Now, let's focus just on the snippet related to the fieldType named text_en. It extends for 19 lines, from the opening <fieldType> tag, to the closing </fieldType> tag.

A few points to note here. First, where you see name, class and positionIncrementGap within the fieldType tag, so horizontally, these are field type properties which will be the focus of the next tutorial.

Second, this analyzer is broken up into two sections vertically, an "index" section and a "query" section. When we put documents into the index we employ the first set of tokenizers and filters, and when we are querying the index, we use the second set.

Third, because this is a chain of processes, each one hands the token stream off to the next, it is best to use more general filters first and more advanced ones later.

And while we are here, do you notice anything that looks familiar?

Yes, these are same six tokenizers and filters we saw in the Solr Admin UI earlier. We saw the ST or Standard Tokenizer. We also saw the SF or Stop Filter and so on. The point is to show you where the actual pointers sit, here in the managed-schema.

Start practicing with analyzers, tokenizers and filters

Okay, so now that we are starting to bump into more intermediate topics here, and I'm reluctant to dig any deeper into analyzers because I am trying to keep this at a beginner level.

As you move forward you need to decide how you feel comfortable editing the schema. Recall, you can hand-edit the schema.xml itself or use the Schema API to edit the managed-schema file.

And remember too that as you head to a production environment, should you manually edit your schema, you will need to turn off the "Schemaless" configuration and its "field guessing" operation. Up to this point, as beginners, we let Solr do the work for us.

There are dozens of different ways to combine analyzers, tokenizers and filters. I suggest thinking about your business needs and from here create some cores, add documents and play around in the Solr Admin UI with different fields. It will be frustrating at first, but stick with it.

Also, in three pages in the reference I pulled out 6 analyzers, 7 tokenizers and 9 filters that are easy to understand.

So where do we go from here? Well, we need a better understanding of field types and properties which I will introduce next and then devote the next tutorial to it.

Step 5 - Introduce Field Types and Properties

Okay, for Step 5 let's bring back that visual of a fieldType block from earlier.

We covered the items in this block vertically. Now, horizontally, where you see name, class, positionIncrementGap and multiValued, these are called properties.

Properties are organized into three groups, which we will cover in the next tutorial.

General field type properties
Class field type properties
Default field type properties

Summary

As you can see, as we start to customize our schema the level of complexity increases. For those who want a quick search application and don't have time to sweat the details, Solr offers its "Schemaless" configuration. And while it my not provide results even close to what users expect from Google.com or Google Site Search, it is a quick way to get up and going.

An alternative, is to use a managed search offering or proceed to customize the schema on your own. As we bump into more intermediate topics you can start to plan your course of action depending on your budget and your business needs.

If you need any help please feel free to reach out to me, I'm here to help.

Related Solr Reference Material

Questions and Answers

Q: Do you suggest manually editing the schema or using the Solr Schema API?
A: Like everything, it comes down to personal preference. I generally like to get my hands dirty with the actual configuration files, whether that be in Linux or Vim, because that is how I learn . The downside is that it is manual. If you need to do it programmatically, say for example if you were building an interface to sit on top of Solr, then for sure you would want to use the API.

Text Analysis with Solr Analyzers, Tokenizers and Filters