Solr Fields - Field and Field Type Properties in Apache Solr

How to Set up Fields and Field Types in Apache Solr Search

Beginner

In our last Solr tutorial, we discussed analyzers, tokenizers and filters as we explored the managed-schema file set up for us by the "Schemaless" configuration after we posted a single HTML document to a core. The Solr field-guessing functionality added fields and fieldTypes as it built a schema for us. The end result was an index with unnecessary overlap. A bloated index like this would slow an application, for example, if we added hundreds or thousands of documents to it.

Here we are keeping it simple by focusing on a single Solr core instead of the distributed SolrCloud mode. This tutorial series is for those building a test environment mainly to check out how realistic it is to build a custom search application.

As we near the end of this beginner tutorial series, you will need to decide whether you want to keep going or subscribe to a third-party service for your search needs. This tutorial should help you make that decision.

One thing to keep in mind is that as you advance with Solr and Lucene, the volume and quality of public documentation thins out, so having experience with Java becomes more and more important. That said, here we will finish our two-part discussion on schema, after tackling text analysis in the last tutorial.

What is the opportunity?

Of course 2018 presents an opportunity as more and more people are searching for alternatives to the Google Search Appliance and Google Site Search offerings that are sunsetting. This void creates an opportunity for those who learn text mining and data analysis with tools like Apache Solr and Elasticsearch.

We also know that Apache Solr can scale as it, and the engine behind it, Apache Lucene, drive some of the largest websites and ecommerce sites in the world. This provides another reason for learning Solr.

So with that as a backdrop, let's have some fun with fields and fieldTypes in Apache Solr.

Apache Solr in Video

Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).

Solr Fields - Field and Field Type Properties in Apache Solr (17:32)

For Those Just Starting Out

Step 1 - Review the Managed-Schema from the _default configset

Okay for Step 1, let's talk about schema. Here we will shift our focus from that bloated managed-schema file I mentioned.

The managed-schema after "Schemaless" field guessing

Recall that with the solrhelp core from the last two tutorials, Solr's field-guessing operation created 17 fields and 63 fieldTypes in 509 lines, even without comments. So a lot of unnecessary configurations.

Of course the advantage of using a "Schemaless" configuration for beginners is speed and simplicity. The disadvantage is that these unnecessary fields increase the size of the index and slow it down. The whole point is to get developers up and going, but when you head to production there is no shortcut, you will need to customize your schema.

The _default managed-schema file

Now let's look at a schema file from the _default configuration set, or one that hasn't been modified yet. For Solr 7 anyway, it sits in the directory server/solr/configsets/_default/conf. This is the one, if you recall, that Solr copies over to new cores and collections when they are created. Opening it with vim and the -R option makes it readonly.

$ pwd solr-7.0.0 $ vim -R server/solr/configsets/_default/conf/managed-schema

I want to make a few points here. First, while there are 943 lines here, or about 400 more than the one set up with field-guessing, here the comments were left in. I mention this because it may be a good idea to read through this. On my end, when I made a new core for my application, I made a copy of this file and removed all comments after I read through and understood what each one did.

Second, is the standard opening line identifying this as an XML file, followed by licensing information and a reminder that this file should only be used as a starting point. Good advice.

<?xml version="1.0" encoding="UTF-8" ?>

Third, as mentioned in the last tutorial, the schema catch-all tag identifies the version number.

As Solr has grown up over the years, the default behaviors of fields and fieldTypes have changed. In this case, version="1.6" will set these defaults. So when we talk about defaults later in the tutorial, this is where they are set.

Step 2 - Discuss 3 Groups of Properties

Now that we have this file open, for Step 2, let's break out three groups of properties and I will show you where they sit.

fieldTypes - The 21 fieldType classes that come with Solr will be covered in Step 3.
fieldType properties - The 7 fieldType properties termed "general properties" in the Apache Solr documentation are covered in Step 4.
field properties - The 19 field properties also termed "field default properties" are covered in Step 5.

View _default fields

Okay, let's start with the third one, fields first. Notice that the default schema only has four.

Each field is given a name and is assigned a fieldType, here it is a "string". So you know, the first two are technically called definitions. The rest, so indexed="true", stored="true" and so on, are called field properties which we will cover in Step 5.

As a takeaway here, remember that the focus is on fields. So if you only had these four fields then you would only need 3 fieldTypes, right? So string, plong and text_general.

And assuming you shut off field-guessing mode, then the schema wouldn't adapt automatically when you submit new documents to it. So if you wanted more fields then you would need to modify the schema yourself. Changes can be made either with the Solr Config API in the managed-schema file or manually in basically the same file, but instead named schema.xml, so you will have to get comfortable with one procedure or the other.

View _default fieldTypes

Now let's focus only on the three fieldTypes just mentioned, starting with the one called string.

Each fieldType is given a name and points to a class in Solr. And again these are called definitions, with this one pointing to StrField for string field. After that, the rest are properties we can customize. These fieldType properties will be covered in Step 4. Let's look at plong now.

It has a similar setup, a name, a class and in this case a docValues property which I will cover in a few minutes. Note that these fieldTypes only require one line in the schema and don't require any analysis. It is text fields that need to be analyzed, as we can see with text_general.

With this multi-line setup for text_general in view, we can see vertically all of the steps that go into how Solr interprets text during the "index" phase and the "query" phase.

Recall, text analysis is a complex process because text, may include multiple languages, capitalizations, punctuations, plurals, synonyms, word stems and even typos.

Okay, so we covered everything vertically here within this fieldType block during our discussion of analyzers, tokenizers and filters in the last tutorial. And I rather painstakingly trimmed about 90 pages from the official Apache Solr documentation down to 3 web pages in our Solr Reference area to save you time, so please see those for details.

Analyzers Reference
Tokenizers Reference
Filters Reference

Our focus here is instead horizontally, and our first stop is with the classes, like solr.TextField here.

Step 3 - Review the Most Common fieldType Classes

In Step 3, let's get to know those 21 fieldType classes and for this I suggest reviewing our reference page on Solr Field Types because much of the work is already done for us there.

Looking at this list, let me draw your attention to a few of the most common classes for beginners. We have fields for booleans, currencies, dates, floating point numbers, integers and geospatial classes for storing data associated with maps.

The two near the bottom are relevant for beginners. The StrField which we saw earlier is for short strings and will not be analyzed or tokenized. A URL link may be assigned to this class, for example.

Next, the TextField is for text that will be broken up into single terms and phrases that go into an inverted index. Phrases are multiple terms that when put together have their own meaning, like "South America". The term "South" has a meaning, the term "America" has a meaning and when put together, the phrase "South America" has another meaning.

Step 4 - Highlight the Most Important fieldType Properties

For Step 4, now that we covered the classes, let's discuss their 7 properties. And looking at the managed-schema file, these sit in the first line, or the opening fieldType tag.

Again, let's use the Solr Field Type Properties page.

The first two again name and class are technically definitions, and they are required. We are talking about fieldTypes here, so the name matches the pointer from the field, right? And class is where we point to one of the 21 Solr classes.

The property positionIncrementGap helps you fine-tune phrase matches, like "South America" from earlier. The autoGeneratePhraseQueries property allows you to turn on phrase queries. Another example is whether Solr will tokenize the full phrase "New York" in addition to "New" and "York".

The last three properties are for more advanced uses, so feel free to review them later.

Step 5 - Cover the 8 Field Properties Often Used by Beginners

Okay, for Step 5 next are the field properties. So in the managed-schema example we are talking about what sits in the field tag discussed earlier.

First, starting out with the definitions, here we see that the field named "id" points to the fieldType "string". It will take on the default properties of that fieldType, the defaults from the field, or we can assign our own properties right in the field block here. Like indexed="true", stored="true", and so on.

Let's cover the meanings here as this is likely the most important point in this tutorial. And again, in the Apache Solr documentation these are called "Field Default Properties".

The 8 field properties

Let's point to the Reference on Solr Field Properties which lists all 19, but I broke out the most important 8 for beginners in a separate table. Most of these are assigned either true or false and the default value is given. I personally like to specify these in the field tag so you don't have to try to memorize the defaults.

First, with the default property you can populate a field with a value if no other value is set at index time.

Second, the indexed property puts that field into the inverted index, meaning it will be searched in queries to retrieve matching documents. So for example, in a web search application, we want the full-text field, here named _text_ to be indexed so it can be searched. The _version_ field, which is like a record number, we might set as "false" because we likely won't need to search for it. The default value here is "true".

Third, the stored property, when "true" can be retrieved in queries. So imagine you are indexing an HTML web page and it goes into _text_. It is unlikely that we want to retrieve the whole document, so in order to save space we would want to select "false". The default value here is "true".

Fourth, the required property is commonly used for ID fields and structured data, like a shopping website where data may be required. If "true" and the field during indexing is not found, then Solr will not index the document. For this property, the default is "false".

Fifth, the multiValued property is similar to a one-to-many relationship in a database. Two use cases here. Recall, in an earlier example using the films dataset we saw that movies were classified under multiple genre. Also, for web pages, you might have multiple keywords supplied as metadata when the document is parsed. Here the default is "false".

And sixth, the docValues property is a little confusing. It basically provides an additional structure that will be used to sort, highlight and provide facets, or groupings, during search. The first really good example I recall seeing a number of years ago was the travel website kayak.com which provided an easy way for users to customize flights by selecting the number of stops, departure times, durations, airports and airlines. I don't know if they use Solr, but docValues gives you the ability to create this type of search functionality, saving users a lot of time. A standard inverted index is not well suited for this and it adds to the size of the index, but if your search application is mission-critical, as it was for kayak.com, then it is good to know this functionality exists.

The 11 other field properties described in the Reference offer additional fine-tuning, so please check those out later.

Review fields set up in the solrhelp core by "Schemaless" field-guessing

And because knowing these field types is so important, let's go through a little exercise by opening up the solrhelp core and examining how its fields were set up. Remember we indexed one unstructured document from an HTML web crawl, and the managed-schema file is located in the server/solr/solrhelp/conf directory. This is the bloated one I mentioned earlier.

First is the field named "id" and here we can see it was assigned the fieldType "string", multiValued means it must be unique. It is indexed, so it can be searched. It is required, so without one the document wouldn't make it into the index and it is stored, so it can be retrieved in queries. I hope this starting to make sense now, as we really come full-circle here.

Next, we have several fields that were imported from the metadata of the HTML document, so author, description, keywords, title and url. These were all assigned to the fieldType "text_general", and they were not assigned any properties here, so the defaults will hold, so they will be searched and returned in queries. Right?

Let's verify this by looking at query results in the Solr Admin UI and all of these should show up if they are stored, and they do.

What else do we know about this? Well, we also know from the defaults that these are indexed, so they have been analyzed, tokenized and filtered according to the text_general fieldType discussed earlier.

Summary

So I hope this is making more sense now. We reviewed schema file settings vertically in the last tutorial and now we hit it horizontally. I try to explain it from a multiple angles because understanding what is in the schema is required as you head to production.

I also produce materials in the Reference that paraphrase the official Apache Solr documentation and can be understood by beginners, so please see if they are helpful for you. To illustrate, this video and those three Reference documents save you from reading over 60 pages in the official Solr documentation, which should help you learn faster, which is our whole mission here at FactorPad.

Where do we go next? Well we still need to talk about settings in the solrconfig.xml file including how to parse files with the Apache Tika parser. Also we will cover considerations as you head to production, like installing Solr on a server instead of a local machine.

And again, as the level of complexity increases you will have to make the decision as to whether you try to build and administer the Solr search application yourself or work with a firm that specializes in it.

If you need any other help or guidance please feel free to reach out to me.

Related Solr Reference Material

Questions and Answers

Q: What other aspects of the schema are important?
A: If your application covers languages other than English then please read through the _default schema files for your language. Also, it is important to know about dynamic fields, which we cover in an upcoming tutorial.

Field and Field Type Properties in Apache Solr