In our last Solr tutorial, we discussed analyzers, tokenizers and
filters as we explored the
managed-schema file set up for us by
the "Schemaless" configuration after we posted a single HTML document
to a core. The Solr field-guessing functionality added
fields and fieldTypes
as it built a schema for us. The end result was an index with
unnecessary overlap. A bloated index like this would slow an
application, for example, if we added hundreds or thousands of
documents to it.
Here we are keeping it simple by focusing on a single Solr core instead of the distributed SolrCloud mode. This tutorial series is for those building a test environment mainly to check out how realistic it is to build a custom search application.
As we near the end of this beginner tutorial series, you will need to decide whether you want to keep going or subscribe to a third-party service for your search needs. This tutorial should help you make that decision.
One thing to keep in mind is that as you advance with Solr and Lucene, the volume and quality of public documentation thins out, so having experience with Java becomes more and more important. That said, here we will finish our two-part discussion on schema, after tackling text analysis in the last tutorial.
Of course 2018 presents an opportunity as more and more people are searching for alternatives to the Google Search Appliance and Google Site Search offerings that are sunsetting. This void creates an opportunity for those who learn text mining and data analysis with tools like Apache Solr and Elasticsearch.
We also know that Apache Solr can scale as it, and the engine behind it, Apache Lucene, drive some of the largest websites and ecommerce sites in the world. This provides another reason for learning Solr.
So with that as a backdrop, let's have some fun with fields and fieldTypes in Apache Solr.
Solr Fields - Field and Field Type Properties in Apache Solr (17:32)
Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).
Okay for Step 1, let's talk about schema. Here we will shift our focus
from that bloated
file I mentioned.
Recall that with the solrhelp core from the last two tutorials, Solr's field-guessing operation created 17 fields and 63 fieldTypes in 509 lines, even without comments. So a lot of unnecessary configurations.
Of course the advantage of using a "Schemaless" configuration for beginners is speed and simplicity. The disadvantage is that these unnecessary fields increase the size of the index and slow it down. The whole point is to get developers up and going, but when you head to production there is no shortcut, you will need to customize your schema.
Now let's look at a schema file from the _default
configuration set, or one that hasn't been modified yet. For Solr 7
anyway, it sits in the directory
This is the one, if you recall, that Solr copies over to new cores and
collections when they are created. Opening it with
vim and the
-R option makes it readonly.
I want to make a few points here. First, while there are 943 lines here, or about 400 more than the one set up with field-guessing, here the comments were left in. I mention this because it may be a good idea to read through this. On my end, when I made a new core for my application, I made a copy of this file and removed all comments after I read through and understood what each one did.
Second, is the standard opening line identifying this as an XML file, followed by licensing information and a reminder that this file should only be used as a starting point. Good advice.
Third, as mentioned in the last tutorial, the schema catch-all tag identifies the version number.
As Solr has grown up over the years, the default behaviors of fields and fieldTypes have changed. In this case, version="1.6" will set these defaults. So when we talk about defaults later in the tutorial, this is where they are set.
Now that we have this file open, for Step 2, let's break out three groups of properties and I will show you where they sit.
Okay, let's start with the third one, fields first. Notice that the default schema only has four.
Each field is given a name and is assigned a fieldType, here it is a "string". So you know, the first two are technically called definitions. The rest, so indexed="true", stored="true" and so on, are called field properties which we will cover in Step 5.
As a takeaway here, remember that the focus is on fields. So if you only had these four fields then you would only need 3 fieldTypes, right? So string, plong and text_general.
And assuming you shut off field-guessing mode, then the
schema wouldn't adapt automatically when you submit new documents to
it. So if you wanted more fields then you would need
to modify the schema yourself. Changes can be made either with the
Solr Config API in the
file or manually in basically the same file, but instead named
schema.xml, so you will have to get
comfortable with one procedure or the other.
Now let's focus only on the three fieldTypes just mentioned, starting with the one called string.
Each fieldType is given a name and points to a class in Solr. And again these are called definitions, with this one pointing to StrField for string field. After that, the rest are properties we can customize. These fieldType properties will be covered in Step 4. Let's look at plong now.
It has a similar setup, a name, a class and in this case a docValues property which I will cover in a few minutes. Note that these fieldTypes only require one line in the schema and don't require any analysis. It is text fields that need to be analyzed, as we can see with text_general.
With this multi-line setup for text_general in view, we can see vertically all of the steps that go into how Solr interprets text during the "index" phase and the "query" phase.
Recall, text analysis is a complex process because text, may include multiple languages, capitalizations, punctuations, plurals, synonyms, word stems and even typos.
Okay, so we covered everything vertically here within this fieldType block during our discussion of analyzers, tokenizers and filters in the last tutorial. And I rather painstakingly trimmed about 90 pages from the official Apache Solr documentation down to 3 web pages in our Solr Reference area to save you time, so please see those for details.
Our focus here is instead horizontally, and our first stop is with the classes, like solr.TextField here.
In Step 3, let's get to know those 21 fieldType classes and for this I suggest reviewing our reference page on Solr Field Types because much of the work is already done for us there.
Looking at this list, let me draw your attention to a few of the most common classes for beginners. We have fields for booleans, currencies, dates, floating point numbers, integers and geospatial classes for storing data associated with maps.
The two near the bottom are relevant for beginners. The
StrField which we saw earlier is
for short strings and will not be analyzed or tokenized. A URL link
may be assigned to this class, for example.
TextField is for text that
will be broken up into single terms and phrases that go into an
inverted index. Phrases are multiple terms that
when put together have their own meaning, like "South America". The
term "South" has a meaning, the term "America" has a meaning and when
put together, the phrase "South America" has another meaning.
For Step 4, now that we covered the classes, let's discuss their 7
properties. And looking at the
managed-schema file, these sit in the
first line, or the opening fieldType tag.
Again, let's use the Solr Field Type Properties page.
The first two again name and class are technically definitions, and they are required. We are talking about fieldTypes here, so the name matches the pointer from the field, right? And class is where we point to one of the 21 Solr classes.
helps you fine-tune phrase matches, like "South America" from earlier.
allows you to turn on phrase queries. Another example is whether
Solr will tokenize the full phrase "New York" in addition to
"New" and "York".
The last three properties are for more advanced uses, so feel free to review them later.
Okay, for Step 5 next are the field properties. So in
managed-schema example we are
talking about what sits in the field tag discussed
First, starting out with the definitions, here we see that the field named "id" points to the fieldType "string". It will take on the default properties of that fieldType, the defaults from the field, or we can assign our own properties right in the field block here. Like indexed="true", stored="true", and so on.
Let's cover the meanings here as this is likely the most important point in this tutorial. And again, in the Apache Solr documentation these are called "Field Default Properties".
Let's point to the Reference on Solr Field Properties which lists all 19, but I broke out the most important 8 for beginners in a separate table. Most of these are assigned either true or false and the default value is given. I personally like to specify these in the field tag so you don't have to try to memorize the defaults.
First, with the
default property you can
populate a field with a value if no other value is set at index time.
indexed property puts that
field into the inverted index, meaning it will be searched
in queries to retrieve matching documents. So for example, in a web
search application, we want the full-text field, here named
_text_ to be indexed so it can be searched. The
_version_ field, which is like a record number, we
might set as "false" because we likely won't need to search for it.
The default value here is "true".
stored property, when "true"
can be retrieved in queries. So imagine you are indexing an HTML web
page and it goes into _text_. It is unlikely that we
want to retrieve the whole document, so in order to save space we would
want to select "false". The default value here is "true".
required property is commonly used
for ID fields and structured data, like a shopping website where data
may be required. If "true" and the field during indexing is not found,
then Solr will not index the document. For this property, the default
multiValued property is
similar to a one-to-many relationship in a database. Two use cases here.
Recall, in an earlier example using the films dataset
we saw that movies were classified under multiple genre. Also, for web
pages, you might have multiple keywords supplied as metadata when the
document is parsed. Here the default is "false".
And sixth, the
docValues property is a little
confusing. It basically provides an additional structure that will be
used to sort, highlight and provide facets, or groupings, during search.
The first really good example I recall seeing a number of years ago
was the travel website kayak.com which provided an easy way for
users to customize flights by selecting the number
of stops, departure times, durations, airports and airlines. I don't
know if they use Solr, but docValues gives you the ability to create
this type of search functionality, saving users a lot of time. A
standard inverted index is not well suited for this and it adds to the
size of the index, but if your search application is mission-critical,
as it was for kayak.com, then it is good to know this functionality
The 11 other field properties described in the Reference offer additional fine-tuning, so please check those out later.
And because knowing these field types is so important, let's go through
a little exercise by opening up the solrhelp core and
examining how its fields were set up. Remember we
indexed one unstructured document from an HTML web crawl, and the
managed-schema file is located in the
This is the bloated one I mentioned earlier.
First is the field named "id" and here we can see it was assigned the fieldType "string", multiValued means it must be unique. It is indexed, so it can be searched. It is required, so without one the document wouldn't make it into the index and it is stored, so it can be retrieved in queries. I hope this starting to make sense now, as we really come full-circle here.
Next, we have several fields that were imported from the metadata of the HTML document, so author, description, keywords, title and url. These were all assigned to the fieldType "text_general", and they were not assigned any properties here, so the defaults will hold, so they will be searched and returned in queries. Right?
Let's verify this by looking at query results in the Solr Admin UI and all of these should show up if they are stored, and they do.
What else do we know about this? Well, we also know from the defaults that these are indexed, so they have been analyzed, tokenized and filtered according to the text_general fieldType discussed earlier.
So I hope this is making more sense now. We reviewed schema file settings vertically in the last tutorial and now we hit it horizontally. I try to explain it from a multiple angles because understanding what is in the schema is required as you head to production.
I also produce materials in the Reference that paraphrase the official Apache Solr documentation and can be understood by beginners, so please see if they are helpful for you. To illustrate, this video and those three Reference documents save you from reading over 60 pages in the official Solr documentation, which should help you learn faster, which is our whole mission here at FactorPad.
Where do we go next? Well we still need to talk about settings in the
solrconfig.xml file including how to
parse files with the Apache Tika parser. Also we will cover
considerations as you head to production, like installing Solr on a
server instead of a local machine.
And again, as the level of complexity increases you will have to make the decision as to whether you try to build and administer the Solr search application yourself or work with a firm that specializes in it.
If you need any other help or guidance please feel free to reach out to me.
Q: What other aspects of the schema are
A: If your application covers languages other than English then please read through the _default schema files for your language. Also, it is important to know about dynamic fields, which we cover in an upcoming tutorial.
There is more to learn at our growing FactorPad YouTube Channel. Subscribe here.