/ factorpad.com / tech / solr / tutorial / solr-index.html
An ad-free and cookie-free website.
Beginner
Up to this point in our Solr Tutorial series we built a test environment to evaluate the capabilities of Apache Solr indexing for website search and enterprise search needs. We are in standalone mode, meaning using one computer in a test environment to learn the concepts before building a search application in a SolrCloud production environment.
In my case, Solr is installed on a Debian 8 Jessie Linux server and my connection is through SSH. Most beginners start with a Linux, macOS or Windows client machine. And while some commands may differ for Windows clients, the concepts and directory locations covered here are the same regardless of operating system.
Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window.
Solr Index - Learn about Inverted Indexes and Apache Solr (16:20)
After an installation and orientation in previous tutorials, now we will focus on the inner workings of Solr with the films dataset provided by Apache. First we need to define an inverted index and see why it is preferred over a database for search.
How do you define an inverted index? You could start out with the Wikipedia definition but if you are new to this, you will likely get confused half way through the first sentence, as I did. So let's keep it simple.
An inverted index is like the index in the back of a book. The index points to each page on which a topic is covered, right? This offers a user the ability to quickly zoom to the desired page or pages.
This is faster than searching a database for the same reason an index in the back of a book is faster, less processing time. Using an inverted index, the processor only has to go down one column and pull out the references. A database would search all fields in each row which like a human searching a book page-by-page is not a very efficient plan.
Now with a basic understanding of inverted indexes, in Step 2 we explore how to build one with a one-file dataset on a core in standalone mode, as this offers the easiest scenario for beginners.
With Solr version 7, Apache provides example data on films that goes with Exercise 2 of their Solr Tutorial. Personally, I find their Exercise 2 difficult for beginners because they cover more intermediate topics in SolrCloud mode like multiple shards, replicas and facets.
Our tutorial is designed so beginners can evaluate whether Solr offers a doable solution for website search and enterprise search needs. That said, we all have the Apache dataset and it offers a good playground to learn Solr features, so let's stick with it.
This Apache dataset is found in the
example/films
directory right from
where Solr was installed, in this case
solr-7.0.0
. The
ls
command in
Linux and macOS systems prints a listing.
Among the 6 files sit 3 with the exact same data, just provided in different formats: csv, json and xml. So each file has the same list of 1,100 films and 5 fields for each film.
I should mention, we are starting with structured data here, or data with fields and values, like you may find in a database or a spreadsheet with rows and columns. You could find this type of data sitting behind the Internet Movie Database (IMDB) website or a shopping website, for example.
Unstructured data, on the other hand, would resemble text on web pages that a search engine like Google, Bing or Yandex crawls and indexes for its users to find their topic of interest. Solr can handle both and we will cover unstructured data later.
So we can think of these as input documents to be put, or posted,
into the core using Solr's post tool located at
bin/post
from the installation
directory. We will do that in a later tutorial because we still have
to customize a few settings first.
We are looking at a subset of 3 of the 1,100 films organized as it might be in a database view with fields across the columns and films down the rows.
Each film has five fields: a unique id for that record or document, the film name, who it was directed_by, the release date and a list of one or many genre the film is classified under.
name | id | directed_by | date | genre |
---|---|---|---|---|
.45 | /en/45_2006 | Gary Lennon | 2006-11-30 | Black comedy, Thriller |
... | ||||
25th Hour | /en/25th_hour | Spike Lee | 2002-12-16 | Crime Fiction, Drama |
... | ||||
Bamboozled | /en/bamboozled | Spike Lee | 2000-10-06 | Satire, Indie film, Music |
... |
Now, we will visualize the data from the input file as if it were an inverted index and for that we need to select a column, or field, to index on. So similar to the example about the back of the book, what field should we organize the index by? If you think about it, we could use any of the last three, logically.
During this analysis phase, it is a good time to think of what a user might want to search for. Let's visualize one such case because it provides a good illustration.
Hypothetically, imagine we were given the input file but didn't have an index and couldn't use the search functionality on a computer? Let's add that our goal is to search the 1,100 films for those directed by Spike Lee. With that we have enough information to select which column to index on and what the inverted index would look like.
When we create this index on the directed_by field we are inverting the file, or turning it on its side, and indexing on that one field plus a link to the id field so we can find the document. After doing so, a search using Spike Lee in that column would be faster. Using this new inverted index, you would go straight to his name, pull the reference to each id and find the answer to what looks like 2 films.
directed_by | id |
---|---|
Gary Lennon | /en/45_2006 |
... | |
Spike Lee | /en/25th_hour, /en/bamboozled |
... |
Another point before we move on is that you could index on other fields as well, meaning you could index on genre and see a subset of 1,100 films in the Drama category for example.
For Step 3, now that we have a very simplified inverted index we need to learn how to build a core. As mentioned, in this base case we will use a Solr core in standalone mode but first we will start a server instance and discuss configuration files.
In the last tutorial we walked through starting a server instance
with the /bin/solr start
command at
the Linux or macOS command line. In Windows you would use
bin\solr.cmd start
.
Keep in mind there are multiple ways to build a core using the command line script but before doing so, I find it helpful to visualize a core's five required fields in the Solr Admin User Interface first.
We covered this in the previous tutorial. To open it we point the
browser to the server using
http://<IP address>:<port>/solr
,
or
http://<localhost>:<port>/solr
if you are using a local installation.
If you click on Core Admin or No cores available it shows the five inputs or defaults Solr requires to build a core. Let me explain the process a bit first and then we will create the core from the command line in Step 4.
server/solr/
directory.server/solr/<name>/data
.server/solr/<name>/conf/
directory.While we are here let me mention that editing schema is one of the more complicated aspects of Solr. It is one thing to create an index in a local environment that helps you search your own personal documents.
Creating an index that will be used in a web application, requires a completely different level of sophistication. In this use case, users may type anything in the search box and will expect a good answer, so finely tuning configuration files for a production environment is where you will spend most of your time.
If we dive into editing schema too early I may lose you. That said, for the films dataset to work properly we will need to make two modifications in the next tutorial. These modifications are also documented in the Apache Solr 7 Tutorial for Exercise 2 if you want to get ahead.
Now that we know what happens behind the scenes, for Step 4 we need to shift our focus to the command line because some aspects of Solr administration can only be accomplished there.
Using the bin/solr -help
command,
let's explore its capabilities by examining the first three lines about
usage.
This lists the 12 commands within the
bin/solr
script and we can append
-help
after each command to learn
about each one individually.
To create a core you can use
either bin/solr create
or
bin/solr create_core
. I prefer the
latter because using it to find the necessary parameters with help is
easier. I suggest reviewing
bin/solr create_core -help
on your end
and you will see that it requires only one option,
-c <name>
. This will accept all
other defaults, and copy over the configuration files from the
_default conf directory, as touched on earlier.
Okay, the core is set up from the command line, but remember we have yet to post any documents to it, and will save that for later.
Four points I would like to make here before moving on. First, notice the warning at the start. This is to the point I made earlier about how the default configurations may be good enough for a local environment but in a production environment you will have to become intimately familiar with these settings, and I'll give you a peak at one of them in a minute.
Second is a note about how Solr automatically creates fields for you when it ingests data, and instructions for how to turn that feature off.
Third is the confirmation that the new directory was created with the name films. Fourth is a section showing commands that were sent to, and received from, the Solr server.
Now that the Solr server is up and we have a core, for Step 5, let's head back to the Solr Admin User Interface to see what changed and observe those default configuration files.
When we reload the page, or go back to the Dashboard, the dropdown for the Core Selector now has films as an option. Selecting that and we have 11 sections. This Overview page details the number of documents in the core, zero at this point, and directories related to this server instance, where data is kept, and where the raw index files will be located.
Please review the other 10 sections on your own time. Much of this will not make sense yet, but it offers a good way to familiarize yourself with Solr capabilities.
I think it is worthwhile to click on 3 of them now. First, under Analysis is information on fields.
Second, Files lists all of the language files used to process text in different languages. Files like stopwords, synonyms and protected words to me are the guts of how we make a useful search tool. Here you can also view the long and complex configuration files I warned you about earlier. And if you know me, I used to be a quantitative equity portfolio manager, and what gets me excited is the process of optimizing systems to work the way we want them to. If you stick around I will help you navigate these details.
Third, under Query is a tool that helps you construct search queries, but we need to get data in there first, which we will do after we edit configuration files in the next tutorial.
So on my end I will leave this core running until the next tutorial.
If you want to stop yours and come back to it you can use the
bin/solr stop
command.
With that, you now know about inverted indexes, our films input files, 5 possible configuration options when creating a new core and the defaults if you simply create one with a name. We also know where we are heading next with configurations. After that we we will post data to the core and performing a basic query, because we still want to see which films Spike Lee directed.
As you can see there are many aspects to creating a useful search application with Apache Solr. If you need help for your particular situation please feel free to reach out with a direct message on social media.
Q: Why is it that when looking at the files
in the /server/solr/films/conf directory schema.xml
is missing?
A: That answer requires a full explanation so
please see the next tutorial.
Before you forget, connect at YouTube, Twitter and through our email list for reminders.
/ factorpad.com / tech / solr / tutorial / solr-index.html
A newly-updated free resource. Connect and refer a friend today.