Solr Index - Learn About Inverted Indexes and Apache Solr

How to Create an Inverted Index for Films Data in Apache Solr

Beginner

Up to this point in our Solr Tutorial series we built a test environment to evaluate the capabilities of Apache Solr indexing for website search and enterprise search needs. We are in standalone mode, meaning using one computer in a test environment to learn the concepts before building a search application in a SolrCloud production environment.

In my case, Solr is installed on a Debian 8 Jessie Linux server and my connection is through SSH. Most beginners start with a Linux, macOS or Windows client machine. And while some commands may differ for Windows clients, the concepts and directory locations covered here are the same regardless of operating system.

Apache Solr in Video

Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window.

Solr Index - Learn about Inverted Indexes and Apache Solr (16:20)

For Those Just Starting Out

Step 1 - Define and Explore the Concept of an Inverted Index

After an installation and orientation in previous tutorials, now we will focus on the inner workings of Solr with the films dataset provided by Apache. First we need to define an inverted index and see why it is preferred over a database for search.

What is an inverted index?

How do you define an inverted index? You could start out with the Wikipedia definition but if you are new to this, you will likely get confused half way through the first sentence, as I did. So let's keep it simple.

What is an Inverted Index? Wikipedia: "In computer science, an inverted index (also referred to as a postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its location in a database file, or in a document or set of documents (named in contrast to a Forward Index, which maps from documents to content)." - source: https://en.wikipedia.org/wiki/Inverted_index

An inverted index is like the index in the back of a book. The index points to each page on which a topic is covered, right? This offers a user the ability to quickly zoom to the desired page or pages.

This is faster than searching a database for the same reason an index in the back of a book is faster, less processing time. Using an inverted index, the processor only has to go down one column and pull out the references. A database would search all fields in each row which like a human searching a book page-by-page is not a very efficient plan.

Step 2 - Analyze the Films Data Provided by Apache

Now with a basic understanding of inverted indexes, in Step 2 we explore how to build one with a one-file dataset on a core in standalone mode, as this offers the easiest scenario for beginners.

The Apache-provided films dataset

With Solr version 7, Apache provides example data on films that goes with Exercise 2 of their Solr Tutorial. Personally, I find their Exercise 2 difficult for beginners because they cover more intermediate topics in SolrCloud mode like multiple shards, replicas and facets.

Our tutorial is designed so beginners can evaluate whether Solr offers a doable solution for website search and enterprise search needs. That said, we all have the Apache dataset and it offers a good playground to learn Solr features, so let's stick with it.

This Apache dataset is found in the example/films directory right from where Solr was installed, in this case solr-7.0.0. The ls command in Linux and macOS systems prints a listing.

$ ls -og example/films total 884 -rw-r--r-- 1 3829 Sep 8 12:34 film_data_generator.py -rw-r--r-- 1 124581 Sep 8 12:34 films.csv -rw-r--r-- 1 300955 Sep 8 12:34 films.json -rw-r--r-- 1 299 Sep 8 12:34 films-LICENSE.txt -rw-r--r-- 1 455444 Sep 8 12:34 films.xml -rw-r--r-- 1 4986 Sep 8 12:34 README.txt

Among the 6 files sit 3 with the exact same data, just provided in different formats: csv, json and xml. So each file has the same list of 1,100 films and 5 fields for each film.

I should mention, we are starting with structured data here, or data with fields and values, like you may find in a database or a spreadsheet with rows and columns. You could find this type of data sitting behind the Internet Movie Database (IMDB) website or a shopping website, for example.

Unstructured data, on the other hand, would resemble text on web pages that a search engine like Google, Bing or Yandex crawls and indexes for its users to find their topic of interest. Solr can handle both and we will cover unstructured data later.

File to be indexed (input document)

So we can think of these as input documents to be put, or posted, into the core using Solr's post tool located at bin/post from the installation directory. We will do that in a later tutorial because we still have to customize a few settings first.

We are looking at a subset of 3 of the 1,100 films organized as it might be in a database view with fields across the columns and films down the rows.

Each film has five fields: a unique id for that record or document, the film name, who it was directed_by, the release date and a list of one or many genre the film is classified under.

Input File Structure (Input Document)
name	id	directed_by	date	genre
.45	/en/45_2006	Gary Lennon	2006-11-30	Black comedy, Thriller
...
25th Hour	/en/25th_hour	Spike Lee	2002-12-16	Crime Fiction, Drama
...
Bamboozled	/en/bamboozled	Spike Lee	2000-10-06	Satire, Indie film, Music
...

The films data is licensed under the Creative Commons Attribution 2.5 Generic License. View the license at http://creativecommons.org/licenses/by/2.5/

The output file (inverted index)

Now, we will visualize the data from the input file as if it were an inverted index and for that we need to select a column, or field, to index on. So similar to the example about the back of the book, what field should we organize the index by? If you think about it, we could use any of the last three, logically.

During this analysis phase, it is a good time to think of what a user might want to search for. Let's visualize one such case because it provides a good illustration.

Hypothetically, imagine we were given the input file but didn't have an index and couldn't use the search functionality on a computer? Let's add that our goal is to search the 1,100 films for those directed by Spike Lee. With that we have enough information to select which column to index on and what the inverted index would look like.

When we create this index on the directed_by field we are inverting the file, or turning it on its side, and indexing on that one field plus a link to the id field so we can find the document. After doing so, a search using Spike Lee in that column would be faster. Using this new inverted index, you would go straight to his name, pull the reference to each id and find the answer to what looks like 2 films.

Output File Structure (Inverted Index)
directed_by	id
Gary Lennon	/en/45_2006
...
Spike Lee	/en/25th_hour, /en/bamboozled
...

Another point before we move on is that you could index on other fields as well, meaning you could index on genre and see a subset of 1,100 films in the Drama category for example.

Step 3 - The Inputs Required to Build an Index in Solr

For Step 3, now that we have a very simplified inverted index we need to learn how to build a core. As mentioned, in this base case we will use a Solr core in standalone mode but first we will start a server instance and discuss configuration files.

Start the Solr server

In the last tutorial we walked through starting a server instance with the /bin/solr start command at the Linux or macOS command line. In Windows you would use bin\solr.cmd start.

$ bin/solr start Waiting up to 180 seconds to see solr running on port 8983 [\] Started Solr server on port 8983 (pid=5700). Happy searching!

Access the Solr Admin User Interface

Keep in mind there are multiple ways to build a core using the command line script but before doing so, I find it helpful to visualize a core's five required fields in the Solr Admin User Interface first.

We covered this in the previous tutorial. To open it we point the browser to the server using http://<IP address>:<port>/solr, or http://<localhost>:<port>/solr if you are using a local installation.

If you click on Core Admin or No cores available it shows the five inputs or defaults Solr requires to build a core. Let me explain the process a bit first and then we will create the core from the command line in Step 4.

name - First is the name field. This will be the name of the core going forward. By default, in the background Solr will create a home directory with that name in the server/solr/ directory.
instanceDir - Second, if we want a different name for that directory we would name it in the instanceDir field. However, at the early stages I would not recommend doing so.
dataDir - Third, with dataDir we could use a different name for the directory where Solr stores its index files, but again it is better at this stage to accept the defaults. The resulting directory will look like this: server/solr/<name>/data.
config - Fourth, the solrconfig.xml file is a one of several configuration files we will explore later. It covers high level interface settings about indexing, administering the core and responding to search queries. It sits inside the server/solr/<name>/conf/ directory.
schema - Fifth, schema.xml is a file that describes the fields in the documents you post to the core, and it needs to be heavily customized for structured data. For unstructured data you can have Solr automatically interpret the fields using what is called a schemaless configuration. Behind the scenes Solr will copy over a basic configuration file to start with and as Solr analyzes the input documents it will modify the schema on its own.

While we are here let me mention that editing schema is one of the more complicated aspects of Solr. It is one thing to create an index in a local environment that helps you search your own personal documents.

Creating an index that will be used in a web application, requires a completely different level of sophistication. In this use case, users may type anything in the search box and will expect a good answer, so finely tuning configuration files for a production environment is where you will spend most of your time.

If we dive into editing schema too early I may lose you. That said, for the films dataset to work properly we will need to make two modifications in the next tutorial. These modifications are also documented in the Apache Solr 7 Tutorial for Exercise 2 if you want to get ahead.

Step 4 - Create a Core Named films

Now that we know what happens behind the scenes, for Step 4 we need to shift our focus to the command line because some aspects of Solr administration can only be accomplished there.

Using the bin/solr -help command, let's explore its capabilities by examining the first three lines about usage.

$ bin/solr -help Usage: solr COMMAND OPTIONS where COMMAND is one of: start, stop, restart, status, healthcheck, create, create_core, create_collection, delete, version, zk, auth

This lists the 12 commands within the bin/solr script and we can append -help after each command to learn about each one individually.

To create a core you can use either bin/solr create or bin/solr create_core. I prefer the latter because using it to find the necessary parameters with help is easier. I suggest reviewing bin/solr create_core -help on your end and you will see that it requires only one option, -c <name>. This will accept all other defaults, and copy over the configuration files from the _default conf directory, as touched on earlier.

$ bin/solr create_core -c films WARNING: Using _default configset. Data driven schema functionality is enabled by default, which is NOT RECOMMENDED for production use. To turn it off: curl http://localhost:8983/solr/films/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}' Copying configuration to new core instance directory: /home/paul/solr-7.0.0/server/solr/films Creating new core 'films' using command: http://localhost:8983/solr/admin/cores?action=CREATE&name=films&instanceDire=films { "responseHeader":{ "status":0, "QTime:1222}, "core":"films"}

Okay, the core is set up from the command line, but remember we have yet to post any documents to it, and will save that for later.

Four points I would like to make here before moving on. First, notice the warning at the start. This is to the point I made earlier about how the default configurations may be good enough for a local environment but in a production environment you will have to become intimately familiar with these settings, and I'll give you a peak at one of them in a minute.

Second is a note about how Solr automatically creates fields for you when it ingests data, and instructions for how to turn that feature off.

Third is the confirmation that the new directory was created with the name films. Fourth is a section showing commands that were sent to, and received from, the Solr server.

Step 5 - Examine the Directory Structure and Customizations

Now that the Solr server is up and we have a core, for Step 5, let's head back to the Solr Admin User Interface to see what changed and observe those default configuration files.

When we reload the page, or go back to the Dashboard, the dropdown for the Core Selector now has films as an option. Selecting that and we have 11 sections. This Overview page details the number of documents in the core, zero at this point, and directories related to this server instance, where data is kept, and where the raw index files will be located.

Please review the other 10 sections on your own time. Much of this will not make sense yet, but it offers a good way to familiarize yourself with Solr capabilities.

I think it is worthwhile to click on 3 of them now. First, under Analysis is information on fields.

Second, Files lists all of the language files used to process text in different languages. Files like stopwords, synonyms and protected words to me are the guts of how we make a useful search tool. Here you can also view the long and complex configuration files I warned you about earlier. And if you know me, I used to be a quantitative equity portfolio manager, and what gets me excited is the process of optimizing systems to work the way we want them to. If you stick around I will help you navigate these details.

Third, under Query is a tool that helps you construct search queries, but we need to get data in there first, which we will do after we edit configuration files in the next tutorial.

So on my end I will leave this core running until the next tutorial. If you want to stop yours and come back to it you can use the bin/solr stop command.

Summary

With that, you now know about inverted indexes, our films input files, 5 possible configuration options when creating a new core and the defaults if you simply create one with a name. We also know where we are heading next with configurations. After that we we will post data to the core and performing a basic query, because we still want to see which films Spike Lee directed.

As you can see there are many aspects to creating a useful search application with Apache Solr. If you need help for your particular situation please feel free to reach out with a direct message on social media.

Related Solr Reference Material

Solr Reference Outline

Questions and Answers

Q: Why is it that when looking at the files in the /server/solr/films/conf directory schema.xml is missing?
A: That answer requires a full explanation so please see the next tutorial.

About Inverted Indexes and Apache Solr