Solr Schema - Configure and Post Files to an Apache Solr Core

Navigate Schema and Configurations in Apache Solr Search

Beginner

With a roadmap for this tutorial set, I want to take a minute to cover how we got here. First, we installed Solr and started a Solr server instance in Standalone mode on this Linux server as a bin/solr status report indicates.

After that, we analyzed our first dataset that comes with Apache Solr 7. It is a structured list of 5 fields on 1,100 films. We created a core called films and Solr automatically copied over default configuration files. A review with the Solr Admin User Interface will show that the core has no documents and no fields. After this tutorial it will, and we will be able to hit our goal, run our first query, and see how many films were directed by Spike Lee.

Most of the commands you see here will work across Linux and macOS machines, and I will note where differences for Windows exist. The concepts and directory locations apply regardless of operating system.

Our first case is about as basic as it gets intentionally, because Solr can be difficult for beginners. After we run through the concepts in this test environment we will know enough to be dangerous. Then we can set our sights on building a custom search box on a website using unstructured data in a production environment. That will be fun.

Apache Solr in Video

Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).

Solr Schema - Configure and Post Files to an Apache Solr Core (15:29)

For Those Just Starting Out

Step 1 - Locate Configuration Files Specific to a Solr Core

Moving on to Step 1, our basic case offers us the luxury of focusing only on the structure called a core, instead of a collection. Each have different directory structures and a core is quite simple.

Solr's two modes

Before we get going with configuration files let me be very clear about the two modes in Solr. The location of files will depend on whether you are using Standalone mode or SolrCloud mode.

Standalone mode - An index is stored on a single computer and the setup is called a core. There can be multiple cores or indexes here, meaning we could have one core for films and another for music hosted on the same machine.
SolrCloud mode - An index is distributed across multiple computers or even multiple server instances on one computer. Groups of documents here are called collections.

The key distinction to me is whether you will split the index across multiple machines or multiple server instances on the same machine. If you do that, you would use SolrCloud mode.

So for us, we have one core named films in Standalone mode. Later, when we discuss SolrCloud mode we will have a separate discussion about configuration file locations, because they differ.

Configuration files in Standalone mode

With that, let's detail the structure of 5 configuration files that apply in Standalone mode. All directories sit within the installation directory, which in our case is solr-7.0.0. To give you an indication of the length and complexity of each file, I will include the number of lines in each one.

server/solr/solr.xml - server instance configurations (53 line xml file).
server/solr/films/core.properties - core configurations such as names, locations and files in the core (4 line text file).
server/solr/films/conf/solrconfig.xml - core configurations for field guessing, directories, query settings, spell checking, keyword highlighting and query response formats (1,387 line xml file).
server/solr/films/conf/managed-schema - core configurations for field processing managed with two Solr tools (943 line xml file).
server/solr/films/conf/schema.xml - core configurations for field processing managed by hand.

The two schema-related configuration files are mutually exclusive. You will have one or the other. So instead of 5 files, think of 4. And in our base case here when we set up the films core, Solr selected the managed-schema file instead of schema.xml. This will require that we use the Solr Admin User Interface or the Solr Schema API from the command line to manage the file instead of hand-editing the schema, which prevents us from making mistakes. We will revisit this in Step 3.

One last point to keep in the back of your mind is that when you move the Solr server to production the directory locations will differ.

Step 2 - Describe a Schema and "Schemaless" Configuration

For step 2, now that we covered configuration file locations we can focus on the schema itself, named either managed-schema or schema.xml.

What is a schema?

A schema is an xml file that tells Solr how to ingest documents into the core, process them and spit out an index that we hope is usable for our audience. In our films case we are using structured data with fields and values, much like a database, so telling Solr how to process documents is vital.

With the schema we set rules around how to process punctuation, capitalized words, email addresses and field types like text or numeric. Also how should Solr create new fields when ingesting new documents? Solr also has a feature to create fields that combine other fields together called Copy Fields, which we will cover in Step 3.

What does a schema file look like?

Let's take a look at the first 60 lines of our managed-schema xml file. Notice how much of it is devoted to comments, about 80% in my estimation. So when I mentioned that the file is 943 lines, only about 200 relate to actual settings, so don't get scared off.

$ head -n 60 server/solr/films/conf/managed-schema <?xml version="1.0" encoding="UTF-8" ?>   <schema name="default-config" version="1.6">

As noted at the top of the file, this is an example schema recommended as a starting point. Take five minutes to browse through it and even though much of it will not make sense yet, it is interesting to see how many languages are covered; including, Basque, Persian and Greek.

What is a "Schemaless" configuration?

Now, what is "schemaless" configuration? The term refers to a very basic and default schema file that will create fields when we send it documents. It was designed for speed and not accuracy. So it can make bad guesses about document structure from time to time.

Here's an example. What if you had a 25 year old law firm and wanted an index that would help you find documentation about specific cases or names of individuals. Imagine thousands of files in formats like Microsoft Word, pdf, text files, spreadsheets. Solr would do its best to organize that but the results probably would not be suitable straight out of the box.

So again "schemaless" configuration is not built for production, but it offers a way to get going so you can analyze the index yourself, see where you can improve the search results and modify the schema accordingly. That is what we will do with our second data set later in this tutorial series, which to me is the fun part.

Step 3 - Review the Two Ways to Edit the managed-schema File

Now for Step 3, we need to make two edits to this "schemaless" configuration file so it works with our films dataset.

The two required customizations to our "schemaless" configuration

Let's visualize the Apache-provided example data we analyzed in the last tutorial.

We are looking at a subset of 3 of the 1,100 films. Each film has five fields: id, name, directed_by, date and a list of one or many genre the film is classified under.

name	id	directed_by	date	genre
.45	/en/45_2006	Gary Lennon	2006-11-30	Black comedy, Thriller
25th Hour	/en/25th_hour	Spike Lee	2002-12-16	Crime Fiction, Drama
Bamboozled	/en/bamboozled	Spike Lee	2000-10-06	Satire, Indie film, Music

The films data is licensed under the Creative Commons Attribution 2.5 Generic License. View the license at http://creativecommons.org/licenses/by/2.5/

This is all explained in the Apache Solr Tutorial documentation, but let me summarize our two issues, then we will modify the schema.

Assign a text field - The first issue is that when Solr ingests this file it will automatically assign a numeric field type for the name field because the title of the first film is .45.
Create a Copy Field - The second topic is that it is helpful in search applications to have a "catchall" field combining data from other fields so you don't have to specify a field during search. That is why we will create a Copy Field here.

We will make the first modification with the Solr Admin UI in a browser and the second with the Solr Schema API at the command line.

Edit the schema using the Solr Admin UI

First off, it would be nice to do everything straight from the Solr Admin UI in a browser of course, but not all functionality from the command line is available there. Also, while a browser is easier for the new Solr user, it requires a person pointing and clicking, so yes, it is manual.

From the Solr Admin UI click on the Schema tab and then the Add Field button and input name, followed by text_general in the field type. We want it to be stored, but not indexed, then hit Add Field and we are ready to head over to the command line for the second modification.

Edit the schema using the Solr Schema API

One of the benefits of using the Solr Schema API is that it is programmatic, meaning you can write programs to automate changes. The downside, as I have found with other command line programs, is that people don't really learn what they're doing. Instead, they just copy someone else's code and hope it works.

That said, since we don't have time to fully explain the curl command, I will summarize it by saying it offers a way to communicate with servers through a variety of protocols and specifically here we are using HTTP.

Again, this line is straight from the Apache Solr 7 Tutorial for Exercise 2 if you have questions. Also note, this should be entered on one line.

$ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" : {"source":"*","dest":"_text_"}}' http://<hostname>:<port>/solr/films/schema

Where <hostname> is localhost or the IP address of the server, and the specified <port>.

To interpret, this is creating a new Copy Field from the source, or all fields (*), and copying that to a new destination field called _text_. This is all communicated through the Schema API endpoint at the address specified.

Okay, with that we should be good to go on modifications to the managed-schema file. We should be ready to post a document and test it out.

Step 4 - Post Documents to the films Core

In Step 4, we are ready to create the index and for that we will use the bin/post command.

Where is the data kept? And in what formats?

As covered in the previous tutorial, the data files are kept in the example/films directory right off the installation directory.

$ ls -og example/films total 884 -rw-r--r-- 1 3829 Sep 8 12:34 film_data_generator.py -rw-r--r-- 1 124581 Sep 8 12:34 films.csv -rw-r--r-- 1 300955 Sep 8 12:34 films.json -rw-r--r-- 1 299 Sep 8 12:34 films-LICENSE.txt -rw-r--r-- 1 455444 Sep 8 12:34 films.xml -rw-r--r-- 1 4986 Sep 8 12:34 README.txt

All three files films.csv, films.json and films.xml have the same data so it doesn't matter which one you select. That said, there is one nuance about the way the data is structured in csv format that makes it a bit tricky, so let's select the xml format for now. This line is for Linux and macOS systems. For Windows, I suggest reading Apache documentation because its post tool is different.

$ bin/post -c films example/films/films.xml

After entering that, Solr returns a confirmation message showing completion and now we get to shift over to the Solr Admin UI.

Step 5 - Search the Films Index in the Solr Admin UI

In Step 5, we will run two quick queries and return to querying later.

Head over to the Solr Admin UI and click on the Query tab. From there navigate to the bottom and click Execute Query. Very good. If yours worked properly it will show the first 10 records in the index in json format. To make it look pretty you could grab this output and present it in an html document, for example.

And to answer our original question from the previous tutorial, navigate to the q field box, type Spike Lee followed by the Execute Query button and there are the two films directed by Spike Lee. Excellent.

Summary

I suggest playing around with search during your free time. Also, click on the Analysis tab to learn a bit about the fields that were imported.

I will leave my server as is for now, and in the next few tutorials we will focus on query functionality and field analysis. After we get comfortable with these aspects using our structured films dataset, we will build another one with unstructured data from a website crawl. That will require that we iterate through indexing, field analysis, schema design, search, and modifying each step until we are comfortable with the outcome. So stay tuned for that.

With that you now know about configuration file locations, the two schema files, and how to edit them at the command line and with the Solr Admin User Interface. We also posted documents to the core and ran our very first query.

As you can see there are many aspects to creating a useful search application with Apache Solr. If you need any help please reach out to me.

Related Solr Reference Material

Solr Reference Outline

Questions and Answers

Q: If I set up an index, can I change from a managed-schema to schema.xml later, or is it permanent?
A: Yes, you can change this, but I would suggest sticking with the managed-schema xml file at least until you have an a solid understanding of field analysis and schema design.

Configure an Apache Solr Schema and Post files to a Core