FactorPad
Faster Learning Tutorials

Configure an Apache Solr Schema and Post Files to a Core

Here we start with an empty core and end with our first search.
  1. Configuration - Locate configuration files specific to a Solr core in Standalone mode.
  2. Solr schema - Describe a schema and considerations for the films dataset when using a "schemaless" configuration.
  3. Modify schema - Review reasons to modify the films schema and edit managed-schema using two methods.
  4. Post to core - Use the post tool to add a document to the films index.
  5. Search films - Run a query with the Solr Admin UI in a browser.
by Paul Alan Davis, CFA, October 21, 2017
Updated: July 14, 2018
So that is how we will perform our first search, now let's walk through each step.

Outline Back Next

~/ home  / tech  / solr  / tutorial  / solr schema


Navigate Schema and Configurations in Apache Solr Search

Beginner

With a roadmap for this tutorial set, I want to take a minute to cover how we got here. First, we installed Solr and started a Solr server instance in Standalone mode on this Linux server as a bin/solr status report indicates.

After that, we analyzed our first dataset that comes with Apache Solr 7. It is a structured list of 5 fields on 1,100 films. We created a core called films and Solr automatically copied over default configuration files. A review with the Solr Admin User Interface will show that the core has no documents and no fields. After this tutorial it will, and we will be able to hit our goal, run our first query, and see how many films were directed by Spike Lee.

Most of the commands you see here will work across Linux and macOS machines, and I will note where differences for Windows exist. The concepts and directory locations apply regardless of operating system.

Our first case is about as basic as it gets intentionally, because Solr can be difficult for beginners. After we run through the concepts in this test environment we will know enough to be dangerous. Then we can set our sights on building a custom search box on a website using unstructured data in a production environment. That will be fun.

Apache Solr in Video

Solr Schema - Configure and Post Files to an Apache Solr Core (15:29)

Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).

For Those Just Starting Out

Step 1 - Locate Configuration Files Specific to a Solr Core

Moving on to Step 1, our basic case offers us the luxury of focusing only on the structure called a core, instead of a collection. Each have different directory structures and a core is quite simple.

Solr's two modes

Before we get going with configuration files let me be very clear about the two modes in Solr. The location of files will depend on whether you are using Standalone mode or SolrCloud mode.

  • Standalone mode - An index is stored on a single computer and the setup is called a core. There can be multiple cores or indexes here, meaning we could have one core for films and another for music hosted on the same machine.
  • SolrCloud mode - An index is distributed across multiple computers or even multiple server instances on one computer. Groups of documents here are called collections.

The key distinction to me is whether you will split the index across multiple machines or multiple server instances on the same machine. If you do that, you would use SolrCloud mode.

So for us, we have one core named films in Standalone mode. Later, when we discuss SolrCloud mode we will have a separate discussion about configuration file locations, because they differ.

Configuration files in Standalone mode

With that, let's detail the structure of 5 configuration files that apply in Standalone mode. All directories sit within the installation directory, which in our case is solr-7.0.0. To give you an indication of the length and complexity of each file, I will include the number of lines in each one.

  • server/solr/solr.xml - server instance configurations (53 line xml file).
  • server/solr/films/core.properties - core configurations such as names, locations and files in the core (4 line text file).
  • server/solr/films/conf/solrconfig.xml - core configurations for field guessing, directories, query settings, spell checking, keyword highlighting and query response formats (1,387 line xml file).
  • server/solr/films/conf/managed-schema - core configurations for field processing managed with two Solr tools (943 line xml file).
  • server/solr/films/conf/schema.xml - core configurations for field processing managed by hand.

The two schema-related configuration files are mutually exclusive. You will have one or the other. So instead of 5 files, think of 4. And in our base case here when we set up the films core, Solr selected the managed-schema file instead of schema.xml. This will require that we use the Solr Admin User Interface or the Solr Schema API from the command line to manage the file instead of hand-editing the schema, which prevents us from making mistakes. We will revisit this in Step 3.

One last point to keep in the back of your mind is that when you move the Solr server to production the directory locations will differ.

Step 2 - Describe a Schema and "Schemaless" Configuration

For step 2, now that we covered configuration file locations we can focus on the schema itself, named either managed-schema or schema.xml.

What is a schema?

A schema is an xml file that tells Solr how to ingest documents into the core, process them and spit out an index that we hope is usable for our audience. In our films case we are using structured data with fields and values, much like a database, so telling Solr how to process documents is vital.

With the schema we set rules around how to process punctuation, capitalized words, email addresses and field types like text or numeric. Also how should Solr create new fields when ingesting new documents? Solr also has a feature to create fields that combine other fields together called Copy Fields, which we will cover in Step 3.

What does a schema file look like?

Let's take a look at the first 60 lines of our managed-schema xml file. Notice how much of it is devoted to comments, about 80% in my estimation. So when I mentioned that the file is 943 lines, only about 200 relate to actual settings, so don't get scared off.

$ head -n 60 server/solr/films/conf/managed-schema <?xml version="1.0" encoding="UTF-8" ?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <!-- This example schema is the recommended starting point for users. It should be kept correct and concise, usable out-of-the-box. For more information, on how to customize this file, please see http://lucene.apache.org/solr/guide/documents-fields-and-schema-design.html PERFORMANCE NOTE: this schema includes many optional features and should not be used for benchmarking. To improve performance one could - set stored="false" for all fields possible (esp large fields) when you only need to search on the field but don't need to return the original value. - set indexed="false" if you don't need to search on the field, but only return the field as a result of searching on other indexed fields. - remove all unneeded copyField statements - for best index size and searching performance, set "index" to false for all general text fields, use copyField to copy them to the catchall "text" field, and use that for searching. --> <schema name="default-config" version="1.6"> <!-- attribute "name" is the name of this schema and is only used for display purposes. version="x.y" is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. 1.0: multiValued attribute did not exist, all fields are multiValued by nature 1.1: multiValued attribute introduced, false by default 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields. 1.3: removed optional field compress feature 1.4: autoGeneratePhraseQueries attribute introduced to drive QueryParser behavior when a single string produces multiple tokens. Defaults to off for version >= 1.4 1.5: omitNorms defaults to true for primitive field types (int, float, boolean, string...) 1.6: useDocValuesAsStored defaults to true. -->

As noted at the top of the file, this is an example schema recommended as a starting point. Take five minutes to browse through it and even though much of it will not make sense yet, it is interesting to see how many languages are covered; including, Basque, Persian and Greek.

What is a "Schemaless" configuration?

Now, what is "schemaless" configuration? The term refers to a very basic and default schema file that will create fields when we send it documents. It was designed for speed and not accuracy. So it can make bad guesses about document structure from time to time.

Here's an example. What if you had a 25 year old law firm and wanted an index that would help you find documentation about specific cases or names of individuals. Imagine thousands of files in formats like Microsoft Word, pdf, text files, spreadsheets. Solr would do its best to organize that but the results probably would not be suitable straight out of the box.

So again "schemaless" configuration is not built for production, but it offers a way to get going so you can analyze the index yourself, see where you can improve the search results and modify the schema accordingly. That is what we will do with our second data set later in this tutorial series, which to me is the fun part.

Step 3 - Review the Two Ways to Edit the managed-schema File

Now for Step 3, we need to make two edits to this "schemaless" configuration file so it works with our films dataset.

The two required customizations to our "schemaless" configuration

Let's visualize the Apache-provided example data we analyzed in the last tutorial.

We are looking at a subset of 3 of the 1,100 films. Each film has five fields: id, name, directed_by, date and a list of one or many genre the film is classified under.

name id directed_by date genre
.45 /en/45_2006 Gary Lennon 2006-11-30 Black comedy, Thriller
25th Hour /en/25th_hour Spike Lee 2002-12-16 Crime Fiction, Drama
Bamboozled /en/bamboozled Spike Lee 2000-10-06 Satire, Indie film, Music
The films data is licensed under the Creative Commons Attribution 2.5 Generic License. View the license at http://creativecommons.org/licenses/by/2.5/

This is all explained in the Apache Solr Tutorial documentation, but let me summarize our two issues, then we will modify the schema.

  1. Assign a text field - The first issue is that when Solr ingests this file it will automatically assign a numeric field type for the name field because the title of the first film is .45.
  2. Create a Copy Field - The second topic is that it is helpful in search applications to have a "catchall" field combining data from other fields so you don't have to specify a field during search. That is why we will create a Copy Field here.

We will make the first modification with the Solr Admin UI in a browser and the second with the Solr Schema API at the command line.

Edit the schema using the Solr Admin UI

First off, it would be nice to do everything straight from the Solr Admin UI in a browser of course, but not all functionality from the command line is available there. Also, while a browser is easier for the new Solr user, it requires a person pointing and clicking, so yes, it is manual.

From the Solr Admin UI click on the Schema tab and then the Add Field button and input name, followed by text_general in the field type. We want it to be stored, but not indexed, then hit Add Field and we are ready to head over to the command line for the second modification.

Edit the schema using the Solr Schema API

One of the benefits of using the Solr Schema API is that it is programmatic, meaning you can write programs to automate changes. The downside, as I have found with other command line programs, is that people don't really learn what they're doing. Instead, they just copy someone else's code and hope it works.

That said, since we don't have time to fully explain the curl command, I will summarize it by saying it offers a way to communicate with servers through a variety of protocols and specifically here we are using HTTP.

Again, this line is straight from the Apache Solr 7 Tutorial for Exercise 2 if you have questions. Also note, this should be entered on one line.

$ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" : {"source":"*","dest":"_text_"}}' http://<hostname>:<port>/solr/films/schema

Where <hostname> is localhost or the IP address of the server, and the specified <port>.

To interpret, this is creating a new Copy Field from the source, or all fields (*), and copying that to a new destination field called _text_. This is all communicated through the Schema API endpoint at the address specified.

Okay, with that we should be good to go on modifications to the managed-schema file. We should be ready to post a document and test it out.

Step 4 - Post Documents to the films Core

In Step 4, we are ready to create the index and for that we will use the bin/post command.

Where is the data kept? And in what formats?

As covered in the previous tutorial, the data files are kept in the example/films directory right off the installation directory.

$ ls -og example/films total 884 -rw-r--r-- 1 3829 Sep 8 12:34 film_data_generator.py -rw-r--r-- 1 124581 Sep 8 12:34 films.csv -rw-r--r-- 1 300955 Sep 8 12:34 films.json -rw-r--r-- 1 299 Sep 8 12:34 films-LICENSE.txt -rw-r--r-- 1 455444 Sep 8 12:34 films.xml -rw-r--r-- 1 4986 Sep 8 12:34 README.txt

All three files films.csv, films.json and films.xml have the same data so it doesn't matter which one you select. That said, there is one nuance about the way the data is structured in csv format that makes it a bit tricky, so let's select the xml format for now. This line is for Linux and macOS systems. For Windows, I suggest reading Apache documentation because its post tool is different.

$ bin/post -c films example/films/films.xml

After entering that, Solr returns a confirmation message showing completion and now we get to shift over to the Solr Admin UI.

Step 5 - Search the Films Index in the Solr Admin UI

In Step 5, we will run two quick queries and return to querying later.

Head over to the Solr Admin UI and click on the Query tab. From there navigate to the bottom and click Execute Query. Very good. If yours worked properly it will show the first 10 records in the index in json format. To make it look pretty you could grab this output and present it in an html document, for example.

And to answer our original question from the previous tutorial, navigate to the q field box, type Spike Lee followed by the Execute Query button and there are the two films directed by Spike Lee. Excellent.

Summary

I suggest playing around with search during your free time. Also, click on the Analysis tab to learn a bit about the fields that were imported.

I will leave my server as is for now, and in the next few tutorials we will focus on query functionality and field analysis. After we get comfortable with these aspects using our structured films dataset, we will build another one with unstructured data from a website crawl. That will require that we iterate through indexing, field analysis, schema design, search, and modifying each step until we are comfortable with the outcome. So stay tuned for that.

With that you now know about configuration file locations, the two schema files, and how to edit them at the command line and with the Solr Admin User Interface. We also posted documents to the core and ran our very first query.

As you can see there are many aspects to creating a useful search application with Apache Solr. If you need any help please reach out to me.


Related Solr Reference Material


Questions and Answers

Q:  If I set up an index, can I change from a managed-schema to schema.xml later, or is it permanent?
A:  Yes, you can change this, but I would suggest sticking with the managed-schema xml file at least until you have an a solid understanding of field analysis and schema design.


What's Next?

Join other highly qualified professionals like yourself at our growing YouTube Channel. Subscribe here.

  • To see the current list of tutorials, click Outline.
  • To learn about inverted indexes in Solr, click Back.
  • To see how search queries work in Solr, click Next.

Outline Back Next

~/ home  / tech  / solr  / tutorial  / solr schema



 
 
Keywords:
apache solr
solr schema
solr script
bin solr
bin post
solr admin ui
solr configuration
solr directory structure
solr examples
solr tutorial
start solr
solr index
solr help
solr core
solr collection