/ factorpad.com / tech / solr / tutorial / solr-basics.html
An ad-free and cookie-free website.
Beginner
In the initial phases of our Solr Tutorial we will be exploring its functionality in a test environment before moving on to a production environment, as advised by Apache. The first step is to get acquainted with the basics of a running Solr instance.
I have Solr installed on a Debian 8 Jessie server on my end and I am using an SSH connection into that server. So while you may have this set up on a Linux, macOS or Windows client machine, a few instructions will be different from my scenario. Where they are different, I will let you know.
Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).
Solr Basics - solr script, Solr Admin, directories and examples (13:10)
In the last tutorial we walked through five steps to an Apache Solr 7 installation including where to find help, how to download appropriate files and how to verify and install the files. We also took a minute to start and stop the service on the server.
First, we will explore the directory where we installed our test
environment. The default installation created the working directory
solr-7.0.0
. The first thing I do
after an installation is to get acquainted with the directory
structure. And to save you time, I will show you what I think is
important for beginners by skipping some of the minutiae, starting
with a list of directory contents using the
ls -og
command.
First, let's summarize the directories, highlighting only the
most important ones for now and in step 3 we will explore the
bin
directory in greater detail.
Directory | Description |
---|---|
bin |
Command line scripts for administering Solr instances, posting documents and creating indexes |
contrib |
Plugins for additional features |
dist |
Solr Java Archive Files or .jar files |
docs |
An HTML document with links to documentation on the Apache website |
example |
Datasets used with the Apache Solr Tutorial |
licenses |
Licenses of third-party libraries |
server |
Programs, log files, configuration files and server scripts |
Next, there are five text files with fairly obvious meanings based on their titles. At this stage I wouldn't spend time reading them unless you are curious and have time.
Text File | Description |
---|---|
CHANGES.txt |
An overview, summary of upgrades, new and deprecated features, configurations and bug fixes |
LICENSE.txt |
The Apache License, Version 2.0, January 2004 |
LUCENE_CHANGES.txt |
A summary of changes to the related program Apache Lucene |
NOTICE.txt |
References to other third party copywritten code |
README.txt |
Descriptions of command line scripts, examples, indexing, files, directories and how to build Apache Solr from source code |
For step 2, we will review the directory associated with the official Solr Tutorial provided by Apache. A link can be found by clicking Resources on the Apache Solr website.
You may ask, if Apache has a tutorial then why did I create one? While I encourage you to review their tutorial, to me it is too difficult for beginners. Their first exercise starts with SolrCloud Mode where you build a small cluster of two servers using two ports to split data into shards across the two nodes as replicas used in a failover situation. So if you are ready to tackle clusters, shards, nodes and replicas right from the start then I encourage you to review the official Apache Tutorial.
This FactorPad tutorial, on the other hand, is designed for absolute beginners who are interested in evaluating whether Solr is a workable solution for their website search and enterprise search needs. Here we move a little slower, preferring to define each terms as we go. That said, because we all now have the Apache data set and it offers a good playground to learn more advanced features let's have a look at it.
These files reside in the example
directory.
Next, list the exampledocs
directory
to see the files associated with Exercise 1 of the Apache Tutorial.
Here is the set of small files used in the Apache example called Techproducts. As you can see, Solr can ingest many types of files and this is only a subset of the many file extensions Solr can work with.
.csv
- Comma Separatate Values.json
- JavaScript Object Notation.xml
- eXtensible Markup Language.jsonl
- for line-delimited JSON files.html
- Hypertext Markup Language.pdf
- Adobe Portable Document Format
Most of the files here are in .xml
format. I suggest opening the file
mem.xml
on your end to see an example.
With Techproducts in the first exercise, Apache is
using structured data, or data you might find in a database, with
fields and values. So imagine a technology retailer has dumped a list
of their products into this directory in
.xml
format and wants Solr to create
a searchable index.
That is an example of structured data. A use case for unstructured data, on the other hand, would be to ingest web pages and create a searchable index much like Google Search does. We will do both here.
The examples in the films
directory
are used in Exercise 2 and the data sets here are much larger.
The second Exercise involves one of my favorite topics, movies.
Scanning the films.xml
file you can
see another example of structured data consisting of fields and values.
In this case, Apache walks through an example using
facets which allow you to drill down into data in a
search application.
If you are asking yourself, why don't we just skip Apache Solr altogether and search the database itself? This is a good and logical question and for an answer stick around for the next tutorial.
Now, for Step 3, let's move up one directory and explore the
bin
directory mentioned earlier as
it includes Solr command line scripts used to create and name an
example data set, put or post documents into it and administer Solr
indexes.
Going from top to bottom init.d
,
install_solr_service.sh
and
oom_solr.sh
all relate to installing
Solr on a production server, so we need not concern ourselves with
those just yet.
The post
script is used to put or post
one or a group of documents into an index.
The solr
script for macOS and Linux
machines, and solr.cmd
for
Windows is used to start, stop and administer the Solr Server from the
command line. We will use this script later in this tutorial.
The solr.in.cmd
and
solr.in.sh
are used to set server
propertites.
So at the beginning stages we need only concern ourselves with two
scripts or commands, post
and
solr
.
Now that you have an idea about the
solr
command, let's explore how we can
use it, starting with how to find help.
In most of the online Solr documentation you will see the
command executed by pointing to it from the Solr home directory, so
make sure you are sitting in the
solr-7.0.0
home directory.
For those of us using a Linux or macOS machine the logical first place
to start is with a man
page.
We can see this doesn't work, but what is helpful is the
bin/solr -help
command.
In the first section is usage information and there are 12, I'll call
them sub-commands; including:
start
,
stop
,
status
and
delete
. This is how you administer
your Solr server from the command line and we will explore each of
these throughout the tutorial series.
To find help on any of these sub-commands type
-help
after each sub-command. So
searching for help on the start
sub-command you would input
bin/solr start -help
to see the
syntax and all of the options for that sub-command.
Okay, now that you know how to find help, the next step is to start
the Solr server using bin/solr start
.
If you see this message, the Solr server started on port 8983 and is waiting for your next request.
Now that the Solr server is up and running let's look at it using the Solr Admin User Interface, a web-based tool that can be used to perform system administration and to enter search queries.
If you installed Solr on a client machine this can be found by entering
http://localhost:8983/solr/
in your
browser address bar.
If you installed Solr on a headless server, like I did, you can enter the hostname for the server, or in my case use its IP address, a colon, then the assigned and default port number 8983 followed by the solr directory.
When we create a dataset this will be more interesting and there is a lot here that is beyond our current knowledge level, but what I would like to mention are two terms people often use synonymously: core and collection.
A core is used in the context of a standalone mode, or non-distributed search using one server, which we have here, whereas a collection is used in SolrCloud mode, or distributed search where the index is spread across several computers to handle heavy search traffic.
At this point since we will begin in standalone mode, we will focus on the term core and refer to it as a collection of documents with its own name, directory, data, configurations and schema that all come together to create a searchable index.
Under Core Admin from within the Solr Admin UI you can create one. Doing so will populate default names for each of five fields. But since we haven't explored these yet, it is a good time to stop until we do.
Before we close out, let's head back to the command line and stop
the Solr server instance using
bin/solr stop
.
Very good. So there is a review of Solr basics. In the next tutorial we will cover the term inverted index leading to a discussion of configuration files including schema.
Of course working with Apache Solr can be challenging, so if you need any customized help for your particular situation please feel free to reach out on social media, including at our FactorPad YouTube Channel.
Q: Which type of data is easier to start with
in Solr, structured or unstructured?
A: I think unstructured data is easier to learn
because it behaves more like a search engine that we are all familiar
with (like Google, Bing or Yandex). Structured data, like that stored
in a database is less intuitive and more difficult for beginners. For
this reason, we will start by indexing unstructured documents.
Don't miss out on any new content, subscribe to our growing YouTube Channel, follow @factorpad on Twitter and our email list is best for periodic strategic updates.
/ factorpad.com / tech / solr / tutorial / solr-basics.html
A newly-updated free resource. Connect and refer a friend today.