Solr Search - The Solr Query Process and How to Interpret Output

Learn the Ins and Outs of Search Queries in Apache Solr

Beginner

For many of us, search is where it's at. In 2017, scores of developers are evaluating Solr for website search as a possible replacement for Google Site Search, Google Custom Search or alternatives like Elasticsearch or Amazon CloudSearch. To make a thorough evaluation of Apache Solr search many of us are building out a test environment to learn functionality before moving on to a development environment. This tutorial will go a long way to help you make a more informed choice.

In previous tutorials, we have taken steps to get a dataset loaded and ready to field queries in the Apache Solr User Interface in a browser. It is from here that we can start to see an application take shape. Also, this is where the mechanical part of getting the systems and data ready is replaced by search analytics, and is where, for many, the fun begins.

Here we will pick up where we left off in the last tutorial. From there we will walk through the process of search in Solr behind the scenes and get a better understanding of configuration files impacting results. If you have yet to get your system set up with the films dataset, I suggest going back a few tutorials to get caught up.

With that, let's get in an have some fun with Apache Solr search.

Apache Solr in Video

Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).

Solr Search - The Solr Query Process and How to Interpret Output (21:07)

For Those Just Starting Out

Step 1 - Introduce Solr Query Output

Moving on to Step 1, I want to remind you where how we got here. We asked our index to return films and our search term, in the q box was Spike Lee, with the goal of finding which films he directed.

Where we left off

A quick word about where we left off with our films core and index. By clicking on Overview we can see it contains those 1,100 documents and it provides particulars about the server Instance, Data and Index directory locations. In our case those documents relate to records in an xml file we posted to the films core in the last tutorial.

Our first query

After we posted it, we clicked on Query and performed two searches. In the first, we used the default of *.* in the q box which selects all records, so if we hit Execute Query here we see the output in json format.

The section under common lists what are called search parameters. The first and only parameter we used was q which is the box for search terms, and we gave it *.* to return all records.

Now, shift your focus to the right side of the Query tab. At the top is the request URL that when used a browser will query the Solr server and bring up these results. So a website search application using this would provide results for you to render as you see fit.

Clicking on this link will kick that query right to a web page if you want to try that out. The format looks like this http://<hostname>:<port>/solr/films/select?q=*.*

where:

<hostname> - refers to the name or <IP address> of the Solr server. In most cases this would be <localhost> for a local installation. In my case, I installed Solr 7 on a server and I'm communicating through SSH here in this window.
port - the number 8983 is the default port, and it can be modified.
/solr/films/ - the path that points to the core we built using the bin/solr create_core command.
select? - calls a Request Handler named select, which is identified at the top left, identified with qt.
q=*.* - the query parameter and search term returning all documents.

With the default settings, the output has two sections at a minimum. First, is a responseHeader which provides basic information about the status with 0 here indicating no errors, Qtime for the query time and params listing the parameters of the query.

The next section is called the response and it provides summary information like numFound for the total 1,100 documents. The start requests that the output start at the first record, or 0, and show 10 rows, which it did, rather than dump out all 1,100. These defaults, like almost everything in Solr can be modified.

Next, we have the section for the first record, a film called .45, as indicated by the field called name. Remember we imported a document with 5 fields: id for the unique id, directed_by for the one or several directors of the film, the initial_release_date and one of several genre the film was classified under.

And because this was a "Schemaless" configuration, Solr created three fields on its own: _version_, genre_str and directed_by_str which are not really relevant to us now.

The films data is licensed under the Creative Commons Attribution 2.5 Generic License. View the license at http://creativecommons.org/licenses/by/2.5/

Our second query

In our second query, our goal was to find movies directed by Spike Lee by inputing his name in the q parameter. Clicking on Execute Query returns 11 documents this time, as identified by numFound. Does that mean he directed 11 films here? Let's see. The first one, 25th Hour was directed by Spike Lee. The second, Bamboozled was as well. The third, Adaptation was directed by Spike Jonze. Interesting, so not a match but we can understand that this is search and it identified the first name as a match on "Spike". What is going on with the fourth? It was directed by Lee Sang-il, so a match on "Lee". You get the point, Solr is finding matches, but not perfect matches, but this is an open-ended search, so using these parameters we shouldn't expect that it will. We just want it to show relevant information. If we wanted an exact match, we could modify this query.

So similar to a search engine, it is returning documents near the top that are more relevant to your request, so that's good. Later we will see numeric scores on each of these films.

Step 2 - Describe the Search Process and its Key Elements

For Step 2, we will cover the 3-part workflow that goes on behind the scenes for every search query in Solr.

Request Handler
Query Parser
Response Writer

Request Handler

The Request Handler is a plug-in that organizes requests. The /select? in the URL points to the select Request Handler. There is another for /update? that provides functionality to update an index, among others. From the Solr Admin UI, the Plugins / Stats provides a link to several installed by default.

Query Parser

The Query Parser comes next and it interprets the parameters selected and the query terms, or what you are searching for, and how the search is performed. Then it sends those requests to the index. In Solr 7 there are three widely-used alternatives.

Standard Query Parser (lucene) - the default in most cases, and is best suited for structured data.
DisMax Query Parser (dismax) - an alternative that is well suited for searching unstructured data like you might find in website search.
Extended DisMax (edismax) - another alternative for unstructured data.

There are about 25 other parsers available for special needs offering flexibility to create fields on the fly, give more weight to some parameters and even geospatial queries used for finding the nearest coffee shop, for example. Keep in mind that different parsers require different parameters, so what you see on the Query tab are those parameters that are common across all Query Parsers, which we cover in Step 4.

Response Writer

After the Query Parser submits requests to the index, additional transformations occur before results are returned. These are performed by the Response Writer. Examples include how much data to include, additional filters or groupings to apply and finally which format to present the data with.

From the Query tab, if you click on the dropdown marked wt you will see the most common 6 Response Writers of the 21 available. This customizes the output for the eventual next destination for the data. For example, if you will perform additional processing in Python, then the output would be customized for Python. The default, is json.

Step 3 - Review the "Schemaless" Schema Created by Solr

For step 3, let's look at the schema created automatically by Solr. If you recall, we didn't spend any time analyzing fields, instead we jumped right in to our first query, so let's do that here.

What is a schema?

A schema is an xml file that tells Solr how to ingest documents into the core, process them into fields and spit out an index we hope is usable for our audience. In our films case with a "schemaless" configuration, by default it automatically interpreted rules for field types, meaning text or numeric. It also has rules for how to process punctuation, capitalized words and web addresses as well. Solr will also create fields on the fly that combine other fields together to aid in search, using what is called a Copy Fields, and we added one in the last tutorial.

Ways to view the schema

Before we look at the schema, we should bring back two points. First, the xml schema files can be managed by hand, or with the help of two Solr tools: the Solr Admin UI or the Solr Schema API from the command line.

Second, recall from the last tutorial that the method for editing the schema dictates the name of the schema. So if it is hand-edited it is called schema.xml. If it is managed using the tools, to prevent us from making mistakes, then it is called managed-schema. The "schemaless" configuration dictated that we use the latter, so let's look at that.

What does a schema file look like?

The managed-schema file is located in the directory called conf/ within the home of the core.

Let's look at the installation directory first, using a pwd followed by an ls -og.

$ pwd; ls -og /home/paul/solr-7.0.0 total 1464 drwxr-xr-x 3 4096 Oct 11 09:16 bin -rw-r--r-- 1 722808 Sep 8 12:36 CHANGES.txt drwxr-xr-x 11 4096 Sep 8 13:21 contrib drwxr-xr-x 4 4096 Oct 1 11:22 dist drwxr-xr-x 3 4096 Oct 2 19:21 docs drwxr-xr-x 7 4096 Oct 3 22:48 example -rw-r--r-- 1 582 Oct 19 09:45 ii.txt drwxr-xr-x 2 32768 Oct 1 11:22 licenses -rw-r--r-- 1 12646 Sep 8 12:34 LICENSE.txt -rw-r--r-- 1 655812 Sep 8 12:36 LUCENE_CHANGES.txt -rw-r--r-- 1 24831 Sep 8 12:34 NOTICE.txt -rw-r--r-- 1 7271 Sep 8 12:34 README.txt drwxr-xr-x 11 4096 Oct 1 11:55 server

I usually keep this installation directory as my working directory so it is easy to access the bin/solr script straight from here. If this looks confusing, don't worry, quickly you will memorize all of the file locations. Also within the server directory sits the films core and all of its settings and data.

$ ls -og server/solr/films/conf total 104 drwxr-xr-x 2 4096 Oct 1 11:22 lang -rw-r--r-- 1 28581 Oct 21 14:21 managed-schema -rw-r--r-- 1 308 Sep 8 12:34 params.json -rw-r--r-- 1 873 Sep 8 12:34 protwords.txt -rw-r--r-- 1 54994 Sep 8 12:36 solrconfig.xml -rw-r--r-- 1 781 Sep 8 12:34 stopwords.txt -rw-r--r-- 1 1124 Sep 8 12:34 synonyms.txt

So the managed-schema file is quite long, about 500 lines. So instead of confusing ourselves by opening the whole thing up and searching for the fields, I find it easier to use the Solr Schema API using the curl command to pull out just the parts we need. Let's try the first one and focus on the fields Solr created when we posted documents to it.

$ curl http://192.168.0.8:8983/solr/films/schema/fields { "responseHeader":{ "status":0, "QTime":0}, "fields":[{ "name":"_root_", "type":"string", "docValues":false, "indexed":true, "stored":false}, { "name":"_text_", "type":"text_general", "mutiValued":true, "indexed":true, "stored":false}, { "name":"_version_", "type":"plong", "indexed":false, "stored":false}, { "name":"directed_by", "type":"text_general"}, { "name":"genre", "type":"text_general"}, { "name":"id", "type":"string", "multiValued":false, "indexed":true, "required":true, "stored":true}, { "name":"initial_release_date", "type":"pdates"}, { "name":"name", "type":"text_general", "indexed":false, "stored":true}]}

Okay, so this is the list of fields in json format and let's make a few observations. First, we can see the output here also has a responseHeader, and the second section shows the fields. What we are interested in at this point is the five fields imported, all at the bottom of the output.

Field	Description
directed_by	directed_by was given a field type of "text_general" which seems logical.
genre	genre is also a "text_general" field.
id	the id field has more going on. It is was assigned a "string" field type, it is not "multiValued" meaning each record has a unique id. It is "indexed", "required" and "stored". So this has all of the characteristics of a unique identification number.
initial_release_date	the initial_release_date field was assigned the type "pdates" for a date field
name	the "name" field, if you recall from the last tutorial we assigned to "text_general" when we edited the schema using the Solr Admin UI. We did this because the first film in the `films.xml` data file we posted to the index was named .45 and if we hadn't named it, Solr would have assigned it a numeric field instead. So we know that worked. We also didn't index it and "stored" here means that it can be reviewed in queries.

In our second modification we set up a Copy Field. We can use the Schema API to pull these as well.

$ curl http://192.168.0.8:8983/solr/films/schema/copyfields { "responseHeader":{ "status":0, "QTime":0}, "copyFields":"[{ "source":"*", "dest":"_text_"}, { "source":"directed_by", "dest":"directed_by_str", "maxChars":256}, { "source":"genre", "dest":"genre_str", "maxChars":"256}]}

So the field "_text_" was the one we created and the second two Solr created on its own. This explains the extra fields we saw in the output earlier.

The point here was to circle back and see what Solr did behind the scenes. We mentioned that the "schemaless" configuration is not meant for production, but helps to get started quickly. For production, we would want to fine-tune each of the fields and field types. This topic will take on more meaning down the road when we use unstructured data, like you might see in a more traditional website search application.

Step 4 - Describe Each Query Parameter

Now for Step 4, let's walk through the default query parameters common across Query Parsers. And for that, keep it simple by using the Solr Admin UI.

Parameter	Description	Default
defType	This selects the query parser, with the three most common being 1) lucene, 2) dismax and 3) edismax.	lucene
q	This is where you enter search terms	.
fq	This is used to set a filter query which creates a cache of potential sub-queries. So if your user will look at more granular information it helps with speed to identify the fields ahead of time so the results are ready in cache.	none
sort	Here you enter how you would like results stored with common options being asc or desc for ascending and descending.	desc
start,rows	The start,rows parameters are used like a search in Google that provides the top 10 results by default and allows you to resubmit the query to find the next 10 results. So think of it as a way to paginate query results. The 0 starts at the first record and 10 shows 10 records, or rows.	0,10
fl	The field list parameter limits the results to a specific list of fields. In order for them to show up the schema must have one of two settings for the field: 1) stored="true" or 2) docValues="true". Multiple fields can be selected and are separated by commas or spaces. You can also return the score given as a measure of the relevance of the search results. A * shows all stored fields.	*
df	In the default field parameter you could enter a specific field you would like Solr to search. In our case, the default search field is the Copy Field we created called _text_.	_text_
Raw Query Parameters	The Raw Query Parameters section is for advanced use, like for query debugging.	none
wt	The wt parameter selects one of 6 popular Response Writers with output customized for 1) json, 2) xml, 3) python, 4) ruby, 5) php, 6) csv. There are another 16 that can be input from the command line.	json

The checkboxes are fairly self-explanatory with indent off and debugQuery referring to the visual format and items returned from the query. You can also select the dismax and edismax Query Parsers here. In addition, specific functions for highlighting (hl) search text, faceting (facet) for results in buckets or groups, geospatial (spatial) results for navigational search and spellchecking (spellcheck) capabilities.

The best way to learn this is with practice, of course.

Step 5 - Practice Query Skills on the Films Dataset

In Step 5, we will walk modify search parameters in the Solr Admin UI and after that we will try one straight from the command line.

Customized Queries using the Solr Admin UI

For our first example, let's type Spike Lee in the q parameter again. We want only the first 5 results, so type 5 in the rows parameter box. And by default they will be returned in descending order based on a relevancy score of the search term. This time let's view the film name and the score by entering name,score in the fl parameter. Clicking on Execute Query button presents these in the default json format.

{ "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"Spike Lee", "fl":"score,name", "rows":"5", "_":"1508900013966"}}, "response":{"numFound":11,"start":0,"maxScore":11.271416,"docs":[ { "name":["25th Hour"], "score":11.271416}, { "name":["Bamboozled"], "score":9.717434}, { "name":["Adaptation"], "score":6.2308226}, { "name":["69"], "score":5.0405936}, { "name":["Basic emotions"], "score":4.8468213}] }}

So to summarize, we see the same number of documents found as before, 11. This time we have a maxScore of 11.271416 and we will cover document relevancy and scoring in future tutorials, but this is helpful to see by how far the two films that Spike Lee actually directed are separated from those in which the name "Spike" or "Lee" were present in the directed_by field in our original dataset.

In our second example, we will add the directed_by field to the fl parameter and use the wt dropdown to select the csv output.

score,name,directed_by 11.27146,25th Hour,Spike Lee 9.717434,Bamboozled,Spike Lee 6.2308226,Adaptation,Spike Jonze 5.0405936,69,Lee Sang-il 4.8468213,Basic emotions,"Thomas Moon,Julie Pham,Georgia Lee"

So this output is nice for those who are comfortable with a spreadsheet-type of format, like those of us with finance and statistics backgrounds. So you could easily kick data like this into Excel and analyze away. Also, note in this format, it returned three directors within double-quotes for the last film, Basic emotions.

It is worth noting that the URL is updated with the code used to request this data from the Solr server. We could use this to train ourselves how to write more advanced queries.

Customize a query with the /select request handler at the command line

Because the Solr Admin UI Query tab only scratches the surface of what can be done at the command line, let's walk through one example from there and include a parameter that allows us to remove the responseHeader from the output and select the dismax Query Parser.

The curl command is used to communicate with servers using a variety of protocols and we will use it here to submit this search request to the Solr server directly.

$ curl "http://192.168.0.8:8983/solr/films/select?defType=dismax&omitHeader=true&fl=score,name,directed_by&q=Spike+Lee&rows=5&wt=json"

Again, for a local installation use localhost instead of an IP_address and the port 8983 was our default, but it could be different.

{ "response":{"numFound":11,"start":0,"maxScore":11.271416,"docs":[ { "directed_by":["Spike Lee"], "name":["25th Hour"], "score":11.271416}, { "directed_by":["Spike Lee"], "name":["Bamboozled"], "score":9.717434}, { "directed_by":["Spike Jonze"], "name":["Adaptation"], "score":6.2308226}, { "directed_by":["Lee Sang-il"], "name":["69"], "score":5.0405936}, { "directed_by":["Thomas Moon", "Julie Pham", "Georgia Lee"], "name":["Basic emotions"], "score":4.8468213}] }}

Assuming we entered this properly, we get 5 films minus the responseHeader in json format. Very good, and if you stick around I will explain more about relevancy scores.

Summary

So this data set is yours to play with, and I suggest adjusting the dials and dive into the results. That is the best way to learn about the analytical process of search and will come in handy whether your goal is enterprise search or evaluating website search as a replacement for Google Custom Search, Amazon CloudSearch or even an alternative search tool to Solr like Elasticsearch.

In the next few tutorials we will look at a new dataset which will be unstructured and come from a website crawl, so more similar to what you would find on websites. This will require a more thorough exploration of the bin/post tool to perform the web crawl. Also, being unstructured, we will need to dive in to field analysis and a topic we haven't discussed yet which is how the index is built with analyzers, tokenizers and filters. That will be a lot of fun, so stay tuned for that.

With that, you should have a nice base of knowledge about search to move on and tackle more advanced topics. You now know about query output, the overall search process, how it ties in with the schema and query parameters and how to customize search to suir your needs.

Yes, there are many aspects to creating a useful search tool and I'm here to help if you need a customized solution. So please feel comfortable reaching out to me.

Related Solr Reference Material

Solr Reference Outline

Questions and Answers

Q: Is the + symbol between Spike and Lee required?
A: The purpose of this is to escape out the white space which requires either a + symbol or the sequence %20.

The Solr Query Process - How to Customize Search and Interpret Results