For many of us, search is where it's at. In 2017, scores of developers are evaluating Solr for website search as a possible replacement for Google Site Search, Google Custom Search or alternatives like Elasticsearch or Amazon CloudSearch. To make a thorough evaluation of Apache Solr search many of us are building out a test environment to learn functionality before moving on to a development environment. This tutorial will go a long way to help you make a more informed choice.
In previous tutorials, we have taken steps to get a dataset loaded and ready to field queries in the Apache Solr User Interface in a browser. It is from here that we can start to see an application take shape. Also, this is where the mechanical part of getting the systems and data ready is replaced by search analytics, and is where, for many, the fun begins.
Here we will pick up where we left off in the last tutorial. From there we will walk through the process of search in Solr behind the scenes and get a better understanding of configuration files impacting results. If you have yet to get your system set up with the films dataset, I suggest going back a few tutorials to get caught up.
With that, let's get in an have some fun with Apache Solr search.
Solr Search - The Solr Query Process and How to Interpret Output (21:07)
Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).
Moving on to Step 1, I want to remind you where how we got here. We asked our index to return films and our search term, in the q box was Spike Lee, with the goal of finding which films he directed.
A quick word about where we left off with our films core and index. By clicking on Overview we can see it contains those 1,100 documents and it provides particulars about the server Instance, Data and Index directory locations. In our case those documents relate to records in an xml file we posted to the films core in the last tutorial.
After we posted it, we clicked on Query and performed
two searches. In the first, we used the default of
*.* in the q box
which selects all records, so if we hit Execute Query
here we see the output in json format.
The section under common
lists what are called search parameters. The first
and only parameter we used was q which is the box for
search terms, and we gave it
*.* to return all records.
Now, shift your focus to the right side of the Query tab. At the top is the request URL that when used a browser will query the Solr server and bring up these results. So a website search application using this would provide results for you to render as you see fit.
Clicking on this link will kick that query right to a web page
if you want to try that out. The format looks like this
With the default settings, the output has two sections at a minimum. First, is a responseHeader which provides basic information about the status with 0 here indicating no errors, Qtime for the query time and params listing the parameters of the query.
The next section is called the response and it provides summary information like numFound for the total 1,100 documents. The start requests that the output start at the first record, or 0, and show 10 rows, which it did, rather than dump out all 1,100. These defaults, like almost everything in Solr can be modified.
Next, we have the section for the first record, a film called .45, as indicated by the field called name. Remember we imported a document with 5 fields: id for the unique id, directed_by for the one or several directors of the film, the initial_release_date and one of several genre the film was classified under.
And because this was a "Schemaless" configuration, Solr created three fields on its own: _version_, genre_str and directed_by_str which are not really relevant to us now.The films data is licensed under the Creative Commons Attribution 2.5 Generic License. View the license at http://creativecommons.org/licenses/by/2.5/
In our second query, our goal was to find movies directed by Spike Lee by inputing his name in the q parameter. Clicking on Execute Query returns 11 documents this time, as identified by numFound. Does that mean he directed 11 films here? Let's see. The first one, 25th Hour was directed by Spike Lee. The second, Bamboozled was as well. The third, Adaptation was directed by Spike Jonze. Interesting, so not a match but we can understand that this is search and it identified the first name as a match on "Spike". What is going on with the fourth? It was directed by Lee Sang-il, so a match on "Lee". You get the point, Solr is finding matches, but not perfect matches, but this is an open-ended search, so using these parameters we shouldn't expect that it will. We just want it to show relevant information. If we wanted an exact match, we could modify this query.
So similar to a search engine, it is returning documents near the top that are more relevant to your request, so that's good. Later we will see numeric scores on each of these films.
For Step 2, we will cover the 3-part workflow that goes on behind the scenes for every search query in Solr.
The Request Handler is a plug-in that organizes requests. The /select? in the URL points to the select Request Handler. There is another for /update? that provides functionality to update an index, among others. From the Solr Admin UI, the Plugins / Stats tab provides a link to several installed by default.
The Query Parser comes next and it interprets the parameters selected and the query terms, or what you are searching for, and how the search is performed. Then it sends those requests to the index. In Solr 7 there are three widely-used alternatives.
There are about 25 other parsers available for special needs offering flexibility to create fields on the fly, give more weight to some parameters and even geospatial queries used for finding the nearest coffee shop, for example. Keep in mind that different parsers require different parameters, so what you see on the Query tab are those parameters that are common across all Query Parsers, which we cover in Step 4.
After the Query Parser submits requests to the index, additional transformations occur before results are returned. These are performed by the Response Writer. Examples include how much data to include, additional filters or groupings to apply and finally which format to present the data with.
From the Query tab, if you click on the dropdown marked wt you will see the most common 6 Response Writers of the 21 available. This customizes the output for the eventual next destination for the data. For example, if you will perform additional processing in Python, then the output would be customized for Python. The default, is json.
For step 3, let's look at the schema created automatically by Solr. If you recall, we didn't spend any time analyzing fields, instead we jumped right in to our first query, so let's do that here.
A schema is an xml file that tells Solr how to ingest documents into the core, process them into fields and spit out an index we hope is usable for our audience. In our films case with a "schemaless" configuration, by default it automatically interpreted rules for field types, meaning text or numeric. It also has rules for how to process punctuation, capitalized words and web addresses as well. Solr will also create fields on the fly that combine other fields together to aid in search, using what is called a Copy Fields, and we added one in the last tutorial.
Before we look at the schema, we should bring back two points. First, the xml schema files can be managed by hand, or with the help of two Solr tools: the Solr Admin UI or the Solr Schema API from the command line.
Second, recall from the last tutorial that the method for editing the
schema dictates the name of the schema. So if it is hand-edited it is
schema.xml. If it is managed
using the tools, to prevent us from making mistakes, then it is called
managed-schema. The "schemaless"
configuration dictated that we use the latter, so let's look at that.
managed-schema file is located in
the directory called
conf/ within the
home of the core.
Let's look at the installation directory first, using a
pwd followed by an
I usually keep this installation directory as my working directory so
it is easy to access the
script straight from here. If this looks confusing, don't worry,
quickly you will memorize all of the file locations. Also within
server directory sits
the films core and all of its settings and data.
managed-schema file is quite
long, about 500 lines. So instead of confusing ourselves by opening the
whole thing up and searching for the fields, I find it easier to use
the Solr Schema API using the
command to pull out just the parts we need. Let's try the first one and
focus on the fields Solr created when we posted documents to it.
Okay, so this is the list of fields in json format and let's make a few observations. First, we can see the output here also has a responseHeader, and the second section shows the fields. What we are interested in at this point is the five fields imported, all at the bottom of the output.
|directed_by||directed_by was given a field type of "text_general" which seems logical.|
|genre||genre is also a "text_general" field.|
|id||the id field has more going on. It is was assigned a "string" field type, it is not "multiValued" meaning each record has a unique id. It is "indexed", "required" and "stored". So this has all of the characteristics of a unique identification number.|
|initial_release_date||the initial_release_date field was assigned the type "pdates" for a date field|
|name||the "name" field, if you recall from the
last tutorial we assigned to "text_general" when we edited the
schema using the Solr Admin UI. We did this because the first film
In our second modification we set up a Copy Field. We can use the Schema API to pull these as well.
So the field "_text_" was the one we created and the second two Solr created on its own. This explains the extra fields we saw in the output earlier.
The point here was to circle back and see what Solr did behind the scenes. We mentioned that the "schemaless" configuration is not meant for production, but helps to get started quickly. For production, we would want to fine-tune each of the fields and field types. This topic will take on more meaning down the road when we use unstructured data, like you might see in a more traditional website search application.
Now for Step 4, let's walk through the default query parameters common across Query Parsers. And for that, keep it simple by using the Solr Admin UI.
|defType||This selects the query parser, with the three most common being 1) lucene, 2) dismax and 3) edismax.||lucene|
|q||This is where you enter search terms||*.*|
|fq||This is used to set a filter query which creates a cache of potential sub-queries. So if your user will look at more granular information it helps with speed to identify the fields ahead of time so the results are ready in cache.||none|
|sort||Here you enter how you would like results stored with common options being asc or desc for ascending and descending.||desc|
|start,rows||The start,rows parameters are used like a search in Google that provides the top 10 results by default and allows you to resubmit the query to find the next 10 results. So think of it as a way to paginate query results. The 0 starts at the first record and 10 shows 10 records, or rows.||0,10|
|fl||The field list parameter limits the results to a specific list of fields. In order for them to show up the schema must have one of two settings for the field: 1) stored="true" or 2) docValues="true". Multiple fields can be selected and are separated by commas or spaces. You can also return the score given as a measure of the relevance of the search results. A * shows all stored fields.||*|
|df||In the default field parameter you could enter a specific field you would like Solr to search. In our case, the default search field is the Copy Field we created called _text_.||_text_|
|Raw Query Parameters||The Raw Query Parameters section is for advanced use, like for query debugging.||none|
|wt||The wt parameter selects one of 6 popular Response Writers with output customized for 1) json, 2) xml, 3) python, 4) ruby, 5) php, 6) csv. There are another 16 that can be input from the command line.||json|
The checkboxes are fairly self-explanatory with indent off and debugQuery referring to the visual format and items returned from the query. You can also select the dismax and edismax Query Parsers here. In addition, specific functions for highlighting (hl) search text, faceting (facet) for results in buckets or groups, geospatial (spatial) results for navigational search and spellchecking (spellcheck) capabilities.
The best way to learn this is with practice, of course.
In Step 5, we will walk modify search parameters in the Solr Admin UI and after that we will try one straight from the command line.
For our first example, let's type
Spike Lee in the q
parameter again. We want only the first 5 results, so type
5 in the rows
parameter box. And by default they will be returned in descending
order based on a relevancy score of the search term. This time let's
view the film name and the score by entering
name,score in the fl
parameter. Clicking on Execute Query button
presents these in the default json format.
So to summarize, we see the same number of documents found as before, 11. This time we have a maxScore of 11.271416 and we will cover document relevancy and scoring in future tutorials, but this is helpful to see by how far the two films that Spike Lee actually directed are separated from those in which the name "Spike" or "Lee" were present in the directed_by field in our original dataset.
In our second example, we will add the
directed_by field to the
fl parameter and use the wt dropdown
to select the csv output.
So this output is nice for those who are comfortable with a spreadsheet-type of format, like those of us with finance and statistics backgrounds. So you could easily kick data like this into Excel and analyze away. Also, note in this format, it returned three directors within double-quotes for the last film, Basic emotions.
It is worth noting that the URL is updated with the code used to request this data from the Solr server. We could use this to train ourselves how to write more advanced queries.
Because the Solr Admin UI Query tab only scratches the surface of what can be done at the command line, let's walk through one example from there and include a parameter that allows us to remove the responseHeader from the output and select the dismax Query Parser.
curl command is used to
communicate with servers using a variety of protocols and we will use
it here to submit this search request to the Solr server directly.
Again, for a local installation use localhost instead of an IP_address and the port 8983 was our default, but it could be different.
Assuming we entered this properly, we get 5 films minus the responseHeader in json format. Very good, and if you stick around I will explain more about relevancy scores.
So this data set is yours to play with, and I suggest adjusting the dials and dive into the results. That is the best way to learn about the analytical process of search and will come in handy whether your goal is enterprise search or evaluating website search as a replacement for Google Custom Search, Amazon CloudSearch or even an alternative search tool to Solr like Elasticsearch.
In the next few tutorials we will look at a new dataset which will be
unstructured and come from a website crawl, so more similar to what
you would find on websites. This will require a more thorough
exploration of the
bin/post tool to
perform the web crawl. Also, being unstructured, we will need to dive
in to field analysis and a topic we haven't discussed yet which is how
the index is built with analyzers, tokenizers and filters. That will
be a lot of fun, so stay tuned for that.
With that, you should have a nice base of knowledge about search to move on and tackle more advanced topics. You now know about query output, the overall search process, how it ties in with the schema and query parameters and how to customize search to suir your needs.
Yes, there are many aspects to creating a useful search tool and I'm here to help if you need a customized solution. So please feel comfortable reaching out to me.
Q: Is the + symbol between
Spike and Lee required?
A: The purpose of this is to escape out the white space which requires either a + symbol or the sequence %20.
If you are interested in topics like this, then the FactorPad YouTube Channel was made for you. Please consider joining our growing group. Subscribe here.