/ factorpad.com / tech / solr / tutorial / solr-schema.html
An ad-free and cookie-free website.
Beginner
With a roadmap for this tutorial set, I want to take a minute to cover
how we got here. First, we installed Solr and started a Solr server
instance in Standalone mode on this Linux server as a
bin/solr status
report indicates.
After that, we analyzed our first dataset that comes with Apache Solr 7. It is a structured list of 5 fields on 1,100 films. We created a core called films and Solr automatically copied over default configuration files. A review with the Solr Admin User Interface will show that the core has no documents and no fields. After this tutorial it will, and we will be able to hit our goal, run our first query, and see how many films were directed by Spike Lee.
Most of the commands you see here will work across Linux and macOS machines, and I will note where differences for Windows exist. The concepts and directory locations apply regardless of operating system.
Our first case is about as basic as it gets intentionally, because Solr can be difficult for beginners. After we run through the concepts in this test environment we will know enough to be dangerous. Then we can set our sights on building a custom search box on a website using unstructured data in a production environment. That will be fun.
Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).
Solr Schema - Configure and Post Files to an Apache Solr Core (15:29)
Moving on to Step 1, our basic case offers us the luxury of focusing only on the structure called a core, instead of a collection. Each have different directory structures and a core is quite simple.
Before we get going with configuration files let me be very clear about the two modes in Solr. The location of files will depend on whether you are using Standalone mode or SolrCloud mode.
The key distinction to me is whether you will split the index across multiple machines or multiple server instances on the same machine. If you do that, you would use SolrCloud mode.
So for us, we have one core named films in Standalone mode. Later, when we discuss SolrCloud mode we will have a separate discussion about configuration file locations, because they differ.
With that, let's detail the structure of 5 configuration files
that apply in Standalone mode. All directories
sit within the installation directory, which in our case is
solr-7.0.0
. To give you an
indication of the length and complexity of each file, I will include
the number of lines in each one.
server/solr/solr.xml
- server instance configurations (53 line xml file).server/solr/films/core.properties
- core configurations such as names, locations and files in the core
(4 line text file).server/solr/films/conf/solrconfig.xml
- core configurations for field guessing, directories, query
settings, spell checking, keyword highlighting and query response
formats (1,387 line xml file).server/solr/films/conf/managed-schema
- core configurations for field processing managed
with two Solr tools (943 line xml file).server/solr/films/conf/schema.xml
- core configurations for field processing managed by hand.
The two schema-related configuration files are mutually exclusive. You
will have one or the other. So instead of 5 files, think of 4. And in
our base case here when we set up the
films core, Solr selected the
managed-schema
file instead of
schema.xml
. This will require that we
use the Solr Admin User Interface or the Solr Schema API
from the command line to manage the file instead of hand-editing the
schema, which prevents us from making mistakes. We will revisit this
in Step 3.
One last point to keep in the back of your mind is that when you move the Solr server to production the directory locations will differ.
For step 2, now that we covered configuration file locations we can
focus on the schema itself, named either
managed-schema
or
schema.xml
.
A schema is an xml file that tells Solr how to ingest documents into the core, process them and spit out an index that we hope is usable for our audience. In our films case we are using structured data with fields and values, much like a database, so telling Solr how to process documents is vital.
With the schema we set rules around how to process punctuation, capitalized words, email addresses and field types like text or numeric. Also how should Solr create new fields when ingesting new documents? Solr also has a feature to create fields that combine other fields together called Copy Fields, which we will cover in Step 3.
Let's take a look at the first 60 lines of our
managed-schema
xml file.
Notice how much of it is devoted to comments, about 80% in my
estimation. So when I mentioned that the file is 943 lines, only about
200 relate to actual settings, so don't get scared off.
As noted at the top of the file, this is an example schema recommended as a starting point. Take five minutes to browse through it and even though much of it will not make sense yet, it is interesting to see how many languages are covered; including, Basque, Persian and Greek.
Now, what is "schemaless" configuration? The term refers to a very basic and default schema file that will create fields when we send it documents. It was designed for speed and not accuracy. So it can make bad guesses about document structure from time to time.
Here's an example. What if you had a 25 year old law firm and wanted an index that would help you find documentation about specific cases or names of individuals. Imagine thousands of files in formats like Microsoft Word, pdf, text files, spreadsheets. Solr would do its best to organize that but the results probably would not be suitable straight out of the box.
So again "schemaless" configuration is not built for production, but it offers a way to get going so you can analyze the index yourself, see where you can improve the search results and modify the schema accordingly. That is what we will do with our second data set later in this tutorial series, which to me is the fun part.
Now for Step 3, we need to make two edits to this "schemaless" configuration file so it works with our films dataset.
Let's visualize the Apache-provided example data we analyzed in the last tutorial.
We are looking at a subset of 3 of the 1,100 films. Each film has five fields: id, name, directed_by, date and a list of one or many genre the film is classified under.
name | id | directed_by | date | genre |
---|---|---|---|---|
.45 | /en/45_2006 | Gary Lennon | 2006-11-30 | Black comedy, Thriller |
25th Hour | /en/25th_hour | Spike Lee | 2002-12-16 | Crime Fiction, Drama |
Bamboozled | /en/bamboozled | Spike Lee | 2000-10-06 | Satire, Indie film, Music |
This is all explained in the Apache Solr Tutorial documentation, but let me summarize our two issues, then we will modify the schema.
We will make the first modification with the Solr Admin UI in a browser and the second with the Solr Schema API at the command line.
First off, it would be nice to do everything straight from the Solr Admin UI in a browser of course, but not all functionality from the command line is available there. Also, while a browser is easier for the new Solr user, it requires a person pointing and clicking, so yes, it is manual.
From the Solr Admin UI click on the Schema tab and then the Add Field button and input name, followed by text_general in the field type. We want it to be stored, but not indexed, then hit Add Field and we are ready to head over to the command line for the second modification.
One of the benefits of using the Solr Schema API is that it is programmatic, meaning you can write programs to automate changes. The downside, as I have found with other command line programs, is that people don't really learn what they're doing. Instead, they just copy someone else's code and hope it works.
That said, since we don't have time to fully explain the
curl
command, I will summarize it by
saying it offers a way to communicate with servers through a variety
of protocols and specifically here we are using HTTP.
Again, this line is straight from the Apache Solr 7 Tutorial for Exercise 2 if you have questions. Also note, this should be entered on one line.
Where <hostname> is localhost or the IP address of the server, and the specified <port>.
To interpret, this is creating a new Copy Field from the source, or all fields (*), and copying that to a new destination field called _text_. This is all communicated through the Schema API endpoint at the address specified.
Okay, with that we should be good to go on modifications to the
managed-schema
file. We should be
ready to post a document and test it out.
In Step 4, we are ready to create the index and for that we will use
the bin/post
command.
As covered in the previous tutorial, the data files are kept in the
example/films
directory right off
the installation directory.
All three files films.csv
,
films.json
and
films.xml
have the same data so it
doesn't matter which one you select. That said, there is one nuance
about the way the data is structured in csv format
that makes it a bit tricky, so let's select the
xml format for now. This line is for Linux and
macOS systems. For Windows, I suggest reading Apache documentation
because its post tool is different.
After entering that, Solr returns a confirmation message showing completion and now we get to shift over to the Solr Admin UI.
In Step 5, we will run two quick queries and return to querying later.
Head over to the Solr Admin UI and click on the Query tab. From there navigate to the bottom and click Execute Query. Very good. If yours worked properly it will show the first 10 records in the index in json format. To make it look pretty you could grab this output and present it in an html document, for example.
And to answer our original question from the previous tutorial, navigate to the q field box, type Spike Lee followed by the Execute Query button and there are the two films directed by Spike Lee. Excellent.
I suggest playing around with search during your free time. Also, click on the Analysis tab to learn a bit about the fields that were imported.
I will leave my server as is for now, and in the next few tutorials we will focus on query functionality and field analysis. After we get comfortable with these aspects using our structured films dataset, we will build another one with unstructured data from a website crawl. That will require that we iterate through indexing, field analysis, schema design, search, and modifying each step until we are comfortable with the outcome. So stay tuned for that.
With that you now know about configuration file locations, the two schema files, and how to edit them at the command line and with the Solr Admin User Interface. We also posted documents to the core and ran our very first query.
As you can see there are many aspects to creating a useful search application with Apache Solr. If you need any help please reach out to me.
Q: If I set up an index, can I change from a
managed-schema to schema.xml later, or is it permanent?
A: Yes, you can change this, but I would suggest
sticking with the managed-schema xml file at least until you have an
a solid understanding of field analysis and schema design.
Join other highly qualified professionals like yourself at our growing YouTube Channel and @factorpad on Twitter.
/ factorpad.com / tech / solr / tutorial / solr-schema.html
A newly-updated free resource. Connect and refer a friend today.