In our last episode, we incorporated learnings from previous tutorials. We went from from building a core to posting a document to the index and we left off with the analysis step. There we touched on how the Lucene Solr combination of programs takes a sentence and analyzes it, breaks up words into tokens and filters out unnecessary characters.
To me, this is when things get fascinating as we approach the academic aspects of language analysis. This is the first part of the Solr documentation I read before deciding to invest my time in learning Solr. My background is in quantitiatve equity portfolio management, and there the focus is on numbers, and here we are focused on words, but there similarities. Algorithms sort through information, cleaning up data and allocating weights to an eventual decision. With search, documents are scored based upon what a user enters in a search box. We may not get into the nitty gritty here, but we'll start to see some of the fascinating complexity.
And for backdrop on the series, we are working in a test environment here, keeping things simple, not focusing on more advanced topics like distributed search or building out a production environment just yet. Instead, we are working with a core called solrhelp and seeing how to tell Solr, or more accurately the engine behind Solr, Lucene, to make documents searchable. Here we will cover how it breaks up the lines of text and in the next video we will talk about the fields and field types themselves, including their classes and properties.
We also talked about changes coming in 2018, where website search and enterprise search is taking on greater importance. With offerings like the Google Search Appliance and Google Site Search coming to an end, the window that is closing for many, creates a window of opportunity for others who learn text analysis tools like Apache Solr and Elasticsearch.
At the same time, Google did raise the bar. Users expect search results to match what they're used to on Google.com and that is no easy task. Think about the teams of developers around the globe building systems to analyze languages. Plus how Google distributes and caches data translates to incredible seach query speed. This is not an inexpensive undertaking and explains their success and why customers expect a lot from a search application.
Of course we can't match that, but with a little work in text analysis here, we can get pretty close. And most of us aren't trying to compete with Google anyway, instead we just want an effective tool for website search.
So with that as a backdrop, let's have some fun with text analysis in Apache Solr Search.
Solr Analyzer - Text Analysis with Lucene Analyzers, Tokenizers and Filters (15:37)
Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).
Okay for Step 1, let's look at where we left off. We indexed one document from a web crawl. To clarify, this is our second data set. In the first we used structured data like you might find in a database, and here we are using unstructured data like text on a website. We also used a "Schemaless" configuration that is Solr's way to allow beginners to get up and going without having to mess with schema files.
The advantage is speed and simplicity. The disadvantage is that often Solr will create fields you don't actually need, leading to a little index bloat and slowing the search application down. Here we will see how more advanced users customize the schema.
In the last tutorial, we kicked off the topic of Fields
and FieldTypes, which are set up in the file called
managed-schema which is the schema
that Solr automatically created during the
bin/solr create_core step and modified
by Solr automatically during the
bin/post step. Now we want to explore
In total, Solr created 17 Fields and 63 fieldTypes and no we won't go through all of them. The point is that each Field is assigned a fieldType and each one of those has its own field analyzer which can be broken up into one or many different sequential steps.
And we don't really need to focus on numbers or booleans because they are easily interpreted, 9 or 10, yes or no; respectively. So these fieldTypes only require one line in the schema and hardly any analysis at all.
However, with the focus on text analysis here, we have a more complex
process because text, whether analyzed during the indexing phase, or
the user's search query phase, can include many different languages,
capitalizations, punctuations, plurals, synonyms, word stems and even
typos. And all of these are programmed by grabbing a class of code,
with the pointers that sit in the
Let's look at a more complicated fieldType,
one with six steps, in the dropdown, called text_en.
Let's see how this one analyzer works during the indexing phase, rather
than the query phase, by typing a sentence and seeing which tokens Solr
creates in the index. Typing
is because Apples' coders ate apples. Uncheck the
Verbose Output box and hit the
Analyze Values button and here we can see a table
listing the six sequential steps in the text analysis process.
Down the left column are abbreviations for the 1 tokenizer and 5 filters.
A couple points here are important. First, is to just see how the process works in a chain, each step does its work and then successively hands off the stream of tokens to the next step for more processing.
Second, is the point I mentioned about why beginners typically go with Solr "Schemaless" mode. Going beyond that to build a search application for a production environment will require that you learn each step, and frankly a lot of people just don't have the time or budget for that.
Okay, with that let's take a step back for a moment so we are clear before examining the XML files.
In Step 2, let's get to know analyzers, tokenizers and filters a bit better.
Filters perform four types of tasks.
For Step 3, now that we understand the analyzer is the parent that delegates work to tokenizers and filters, let's see how it is structured.
In our first example, we start with a one-line analyzer.
In this example the analyzer sits within the tags for
the fieldType named text_general. In
this case it is a one-line analyzer, so it is closed off with the
/ at the end because this is XML.
Second, the StandardAnalyzerFactory class has built
into it a tokenizer and several filters. So, the point is, you can use
a preconfigured analyzer without having to customize the tokenizers
and filters that sit below them.
Next, let's look at an analyzer parent with multiple children that can be customized.
Okay, let's make a few observations here. Notice how the analyzer tag
here doesn't point to a class. Instead it delegates to three
children, one tokenizer and two filters. It also doesn't have that
/ and is instead closed with
a separate analyzer tag.
Now that we have seen two fairly basic analyzer setups, let's move on and look at a more complex and realistic one.
For Step 4, let's get a real-life example created automatically when we posted one HTML file to the index from the last tutorial. This will give us an idea as to how we could customize the schema in XML.
We are sitting in the installation directory and in my case this
solr-7.0.0 and from here we can
managed-schema file that
sits in the configuration directory for the core. We are not going to
edit this now, so in my text editor
vim I will use the
-R flag for readonly mode.
First, on the structure of this file, the first line identifies the
XML format. Second, is a reminder that this file
shouldn't be manually edited. Third is the schema tag and its
version which will have more meaning in the next tutorial. And finally,
to give you an idea as to complexity, there are 509 lines in this
Now, let's focus just on the snippet related to the fieldType named text_en. It extends for 19 lines, from the opening <fieldType> tag, to the closing </fieldType> tag.
A few points to note here. First, where you see name, class and positionIncrementGap within the fieldType tag, so horizontally, these are field type properties which will be the focus of the next tutorial.
Second, this analyzer is broken up into two sections vertically, an "index" section and a "query" section. When we put documents into the index we employ the first set of tokenizers and filters, and when we are querying the index, we use the second set.
Third, because this is a chain of processes, each one hands the token stream off to the next, it is best to use more general filters first and more advanced ones later.
And while we are here, do you notice anything that looks familiar?
Yes, these are same six tokenizers and filters we saw in the Solr
Admin UI earlier. We saw the ST or
Standard Tokenizer. We also saw the
SF or Stop Filter and so on.
The point is to show you where the actual pointers sit, here in the
Okay, so now that we are starting to bump into more intermediate topics here, and I'm reluctant to dig any deeper into analyzers because I am trying to keep this at a beginner level.
As you move forward you need to decide how you feel comfortable editing
the schema. Recall, you can hand-edit the
schema.xml itself or use the Schema
API to edit the
And remember too that as you head to a production environment, should you manually edit your schema, you will need to turn off the "Schemaless" configuration and its "field guessing" operation. Up to this point, as beginners, we let Solr do the work for us.
There are dozens of different ways to combine analyzers, tokenizers and filters. I suggest thinking about your business needs and from here create some cores, add documents and play around in the Solr Admin UI with different fields. It will be frustrating at first, but stick with it.
So where do we go from here? Well, we need a better understanding of field types and properties which I will introduce next and then devote the next tutorial to it.
Okay, for Step 5 let's bring back that visual of a fieldType block from earlier.
We covered the items in this block vertically. Now, horizontally, where you see name, class, positionIncrementGap and multiValued, these are called properties.
Properties are organized into three groups, which we will cover in the next tutorial.
As you can see, as we start to customize our schema the level of complexity increases. For those who want a quick search application and don't have time to sweat the details, Solr offers its "Schemaless" configuration. And while it my not provide results even close to what users expect from Google.com or Google Site Search, it is a quick way to get up and going.
An alternative, is to use a managed search offering or proceed to customize the schema on your own. As we bump into more intermediate topics you can start to plan your course of action depending on your budget and your business needs.
If you need any help please feel free to reach out to me, I'm here to help.
Q: Do you suggest manually editing the schema or
using the Solr Schema API?
A: Like everything, it comes down to personal preference. I generally like to get my hands dirty with the actual configuration files, whether that be in Linux or Vim, because that is how I learn . The downside is that it is manual. If you need to do it programmatically, say for example if you were building an interface to sit on top of Solr, then for sure you would want to use the API.
If you would like other free opportunities to learn, join our growing FactorPad YouTube Channel. Subscribe here.