Apache Solr Basics - Solr Script, Solr Admin, Directories and Examples

Explore the Basics of File Structures and Solr Server Administration

Beginner

In the initial phases of our Solr Tutorial we will be exploring its functionality in a test environment before moving on to a production environment, as advised by Apache. The first step is to get acquainted with the basics of a running Solr instance.

I have Solr installed on a Debian 8 Jessie server on my end and I am using an SSH connection into that server. So while you may have this set up on a Linux, macOS or Windows client machine, a few instructions will be different from my scenario. Where they are different, I will let you know.

Apache Solr in Video

Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).

Solr Basics - solr script, Solr Admin, directories and examples (13:10)

For Those Just Starting Out

Step 1 - Explore the Directory Layout

In the last tutorial we walked through five steps to an Apache Solr 7 installation including where to find help, how to download appropriate files and how to verify and install the files. We also took a minute to start and stop the service on the server.

What is in the solr-7.0.0 main directory?

First, we will explore the directory where we installed our test environment. The default installation created the working directory solr-7.0.0. The first thing I do after an installation is to get acquainted with the directory structure. And to save you time, I will show you what I think is important for beginners by skipping some of the minutiae, starting with a list of directory contents using the ls -og command.

$ cd ~/solr-7.0.0 $ ls -og total 1460 drwxr-xr-x 3 4096 Oct 1 23:06 bin -rw-r--r-- 1 722808 Sep 8 12:36 CHANGES.txt drwxr-xr-x 11 4096 Sep 8 13:21 contrib drwxr-xr-x 4 4096 Oct 1 11:22 dist drwxr-xr-x 3 4096 Oct 1 11:22 docs drwxr-xr-x 7 4096 Oct 1 11:22 example drwxr-xr-x 2 32768 Oct 1 11:22 licenses -rw-r--r-- 1 12646 Sep 8 12:34 LICENSE.txt -rw-r--r-- 1 655812 Sep 8 12:36 LUCENE_CHANGES.txt -rw-r--r-- 1 24831 Sep 8 12:34 NOTICE.txt -rw-r--r-- 1 7271 Sep 8 12:34 README.txt drwxr-xr-x 11 4096 Oct 1 11:55 server

Review the directories

First, let's summarize the directories, highlighting only the most important ones for now and in step 3 we will explore the bin directory in greater detail.

Directory	Description
`bin`	Command line scripts for administering Solr instances, posting documents and creating indexes
`contrib`	Plugins for additional features
`dist`	Solr Java Archive Files or .jar files
`docs`	An HTML document with links to documentation on the Apache website
`example`	Datasets used with the Apache Solr Tutorial
`licenses`	Licenses of third-party libraries
`server`	Programs, log files, configuration files and server scripts

Review the text files

Next, there are five text files with fairly obvious meanings based on their titles. At this stage I wouldn't spend time reading them unless you are curious and have time.

Text File	Description
`CHANGES.txt`	An overview, summary of upgrades, new and deprecated features, configurations and bug fixes
`LICENSE.txt`	The Apache License, Version 2.0, January 2004
`LUCENE_CHANGES.txt`	A summary of changes to the related program Apache Lucene
`NOTICE.txt`	References to other third party copywritten code
`README.txt`	Descriptions of command line scripts, examples, indexing, files, directories and how to build Apache Solr from source code

Step 2 - Run Through the Apache-Provided Example Data Sets

For step 2, we will review the directory associated with the official Solr Tutorial provided by Apache. A link can be found by clicking Resources on the Apache Solr website.

You may ask, if Apache has a tutorial then why did I create one? While I encourage you to review their tutorial, to me it is too difficult for beginners. Their first exercise starts with SolrCloud Mode where you build a small cluster of two servers using two ports to split data into shards across the two nodes as replicas used in a failover situation. So if you are ready to tackle clusters, shards, nodes and replicas right from the start then I encourage you to review the official Apache Tutorial.

This FactorPad tutorial, on the other hand, is designed for absolute beginners who are interested in evaluating whether Solr is a workable solution for their website search and enterprise search needs. Here we move a little slower, preferring to define each terms as we go. That said, because we all now have the Apache data set and it offers a good playground to learn more advanced features let's have a look at it.

These files reside in the example directory.

$ cd example $ ls example-DIH exampledocs files films README.txt resources

Next, list the exampledocs directory to see the files associated with Exercise 1 of the Apache Tutorial.

$ ls -og exampledocs total 128 -rw-r--r-- 1 959 Sep 8 12:34 books.csv -rw-r--r-- 1 1148 Sep 8 12:34 books.json -rw-r--r-- 1 1333 Sep 8 12:34 gb18030-example.xml -rw-r--r-- 1 2245 Sep 8 12:34 hd.xml -rw-r--r-- 1 2074 Sep 8 12:34 ipod_other.xml -rw-r--r-- 1 2109 Sep 8 12:34 ipod_video.xml -rw-r--r-- 1 2801 Sep 8 12:34 manufacturers.xml -rw-r--r-- 1 3090 Sep 8 12:34 mem.xml -rw-r--r-- 1 2156 Sep 8 12:34 money.xml -rw-r--r-- 1 1402 Sep 8 12:34 monitor2.xml -rw-r--r-- 1 1420 Sep 8 12:34 monitor.xml -rw-r--r-- 1 178 Sep 8 12:34 more_books.jsonl -rw-r--r-- 1 1976 Sep 8 12:34 mp500.xml -rw-r--r-- 1 27146 Sep 8 13:21 post.jar -rw-r--r-- 1 235 Sep 8 12:34 sample.html -rw-r--r-- 1 1684 Sep 8 12:34 sd500.xml -rw-r--r-- 1 21052 Sep 8 12:34 solr-word.pdf -rw-r--r-- 1 1810 Sep 8 12:34 solr.xml -rwxr-xr-x 1 3742 Sep 8 12:34 test_utf8.sh -rw-r--r-- 1 1835 Sep 8 12:34 utf8-example.xml -rw-r--r-- 1 2697 Sep 8 12:34 vidcard.xml

Here is the set of small files used in the Apache example called Techproducts. As you can see, Solr can ingest many types of files and this is only a subset of the many file extensions Solr can work with.

.csv - Comma Separatate Values
.json - JavaScript Object Notation
.xml - eXtensible Markup Language
.jsonl - for line-delimited JSON files
.html - Hypertext Markup Language
.pdf - Adobe Portable Document Format

Most of the files here are in .xml format. I suggest opening the file mem.xml on your end to see an example.

With Techproducts in the first exercise, Apache is using structured data, or data you might find in a database, with fields and values. So imagine a technology retailer has dumped a list of their products into this directory in .xml format and wants Solr to create a searchable index.

That is an example of structured data. A use case for unstructured data, on the other hand, would be to ingest web pages and create a searchable index much like Google Search does. We will do both here.

The examples in the films directory are used in Exercise 2 and the data sets here are much larger.

$ ls -og films total 884 -rw-r--r-- 1 3829 Sep 8 12:34 film_data_generator.py -rw-r--r-- 1 124581 Sep 8 12:34 films.csv -rw-r--r-- 1 300955 Sep 8 12:34 films.json -rw-r--r-- 1 299 Sep 8 12:34 films-LICENSE.txt -rw-r--r-- 1 455444 Sep 8 12:34 films.xml -rw-r--r-- 1 4986 Sep 8 12:34 README.txt

The second Exercise involves one of my favorite topics, movies. Scanning the films.xml file you can see another example of structured data consisting of fields and values. In this case, Apache walks through an example using facets which allow you to drill down into data in a search application.

If you are asking yourself, why don't we just skip Apache Solr altogether and search the database itself? This is a good and logical question and for an answer stick around for the next tutorial.

Step 3 - Review Scripts in the bin Directory

Now, for Step 3, let's move up one directory and explore the bin directory mentioned earlier as it includes Solr command line scripts used to create and name an example data set, put or post documents into it and administer Solr indexes.

$ cd .. $ ls -og bin total 196 drwxr-xr-x 2 4096 Sep 8 12:34 init.d -rwxr-xr-x 1 12694 Sep 8 12:34 install_solr_service.sh -rwxr-xr-x 1 1255 Sep 8 12:34 oom_solr.sh -rwxr-xr-x 1 8209 Sep 8 12:34 post -rwxr-xr-x 1 74749 Sep 8 12:36 solr -rwxr-xr-x 1 68007 Sep 8 12:36 solr.cmd -rwxr-xr-x 1 6831 Sep 8 12:34 solr.in.cmd -rwxr-xr-x 1 7314 Sep 8 12:34 solr.in.sh

Going from top to bottom init.d, install_solr_service.sh and oom_solr.sh all relate to installing Solr on a production server, so we need not concern ourselves with those just yet.

The post script is used to put or post one or a group of documents into an index.

The solr script for macOS and Linux machines, and solr.cmd for Windows is used to start, stop and administer the Solr Server from the command line. We will use this script later in this tutorial.

The solr.in.cmd and solr.in.sh are used to set server propertites.

So at the beginning stages we need only concern ourselves with two scripts or commands, post and solr.

Step 4 - Start the Solr Server and Explore Help

Now that you have an idea about the solr command, let's explore how we can use it, starting with how to find help.

Finding help on the /bin/solr command

In most of the online Solr documentation you will see the command executed by pointing to it from the Solr home directory, so make sure you are sitting in the solr-7.0.0 home directory.

$ cd ~/solr-7.0.0

For those of us using a Linux or macOS machine the logical first place to start is with a man page.

$ man solr No manual entry for solr

We can see this doesn't work, but what is helpful is the bin/solr -help command.

$ bin/solr -help Usage: solr COMMAND OPTIONS where COMMAND is one of: start, stop, restart, status, healthcheck, create, create_core, create_collection, delete, version, zk, auth Standalone server example (start Solr running in the background on port 8984): ./solr start -p 8984 SolrCloud example (start Solr running in SolrCloud mode using localhost:2181 to connect to Zookeeper, with 1g max heap size and remote Java debug options enabled): ./solr start -c -m 1g -z localhost:2181 -a "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044" Pass -help after any COMMAND to see command-specific usage information, such as: ./solr start -help or ./solr stop -help

In the first section is usage information and there are 12, I'll call them sub-commands; including: start, stop, status and delete. This is how you administer your Solr server from the command line and we will explore each of these throughout the tutorial series.

To find help on any of these sub-commands type -help after each sub-command. So searching for help on the start sub-command you would input bin/solr start -help to see the syntax and all of the options for that sub-command.

$ bin/solr start -help Usage: solr start [-f] [-c] [-h hostname] [-p port] [-d directory] [-z zkHost] [-m memory] [-e example] [-s solr.solr.home] [-t solr.data.home] [-a "additional-options"] [-V] -f Start Solr in foreground; default starts Solr in the background and sends stdout / stderr to solr-PORT-console.log -c or -cloud Start Solr in SolrCloud mode; if -z not supplied, an embedded Zookeeper instance is started on Solr port+1000, such as 9983 if Solr is bound to 8983 -h <host> Specify the hostname for this Solr instance -p <port> Specify the port to start the Solr HTTP listener on; default is 8983 The specified port (SOLR_PORT) will also be used to determine the stop port STOP_PORT=($SOLR_PORT-1000) and JMX RMI listen port RMI_PORT=($SOLR_PORT+10000). For instance, if you set -p 8985, then the STOP_PORT=7985 and RMI_PORT=18985 -d <dir> Specify the Solr server directory; defaults to server -z <zkHost> Zookeeper connection string; only used when running in SolrCloud mode using -c To launch an embedded Zookeeper instance, don't pass this parameter. -m <memory> Sets the min (-Xms) and max (-Xmx) heap size for the JVM, such as: -m 4g results in: -Xms4g -Xmx4g; by default, this script sets the heap size to 512m -s <dir> Sets the solr.solr.home system property; Solr will create core directories under this directory. This allows you to run multiple Solr instances on the same host while reusing the same server directory set using the -d parameter. If set, the specified directory should contain a solr.xml file, unless solr.xml exists in Zookeeper. This parameter is ignored when running examples (-e), as the solr.solr.home depends on which example is run. The default value is server/solr. -t <dir> Sets the solr.data.home system property, where Solr will store data (index). If not set, Solr uses solr.solr.home for config and data. -e <example> Name of the example to run; available examples: cloud: SolrCloud example techproducts: Comprehensive example illustrating many of Solr's core capabilities dih: Data Import Handler schemaless: Schema-less example -a Additional parameters to pass to the JVM when starting Solr, such as to setup Java debug options. For example, to enable a Java debugger to attach to the Solr JVM you could pass: -a "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=18983" In most cases, you should wrap the additional parameters in double quotes. -j Additional parameters to pass to Jetty when starting Solr. For example, to add configuration folder that jetty should read you could pass: -j "--include-jetty-dir=/etc/jetty/custom/server/" In most cases, you should wrap the additional parameters in double quotes. -noprompt Don't prompt for input; accept all defaults when running examples that accept user input -v and -q Verbose (-v) or quiet (-q) logging. Sets default log level to DEBUG or WARN instead of INFO -V or -verbose Verbose messages from this script

Start the Solr Server using /bin/solr

Okay, now that you know how to find help, the next step is to start the Solr server using bin/solr start.

$ bin/solr start Waiting up to 180 seconds to see Solr running on port 8983 [\] Started Solr server on port 8983 (pid=11828). Happy searching! $ _

If you see this message, the Solr server started on port 8983 and is waiting for your next request.

Step 5 - View the Solr Admin Console in a Browser

Now that the Solr server is up and running let's look at it using the Solr Admin User Interface, a web-based tool that can be used to perform system administration and to enter search queries.

On a localhost

If you installed Solr on a client machine this can be found by entering http://localhost:8983/solr/ in your browser address bar.

On a server

If you installed Solr on a headless server, like I did, you can enter the hostname for the server, or in my case use its IP address, a colon, then the assigned and default port number 8983 followed by the solr directory.

When we create a dataset this will be more interesting and there is a lot here that is beyond our current knowledge level, but what I would like to mention are two terms people often use synonymously: core and collection.

A core is used in the context of a standalone mode, or non-distributed search using one server, which we have here, whereas a collection is used in SolrCloud mode, or distributed search where the index is spread across several computers to handle heavy search traffic.

At this point since we will begin in standalone mode, we will focus on the term core and refer to it as a collection of documents with its own name, directory, data, configurations and schema that all come together to create a searchable index.

Under Core Admin from within the Solr Admin UI you can create one. Doing so will populate default names for each of five fields. But since we haven't explored these yet, it is a good time to stop until we do.

Stop the Solr server from the command line

Before we close out, let's head back to the command line and stop the Solr server instance using bin/solr stop.

$ bin/solr stop Sending stop command to Solr running on port 8983 ... Waiting up to 180 seconds to allow Jetty process 11828 to stop gracefully. $ _

Very good. So there is a review of Solr basics. In the next tutorial we will cover the term inverted index leading to a discussion of configuration files including schema.

Of course working with Apache Solr can be challenging, so if you need any customized help for your particular situation please feel free to reach out on social media, including at our FactorPad YouTube Channel.

Related Solr Refernce Material

Questions and Answers

Q: Which type of data is easier to start with in Solr, structured or unstructured?
A: I think unstructured data is easier to learn because it behaves more like a search engine that we are all familiar with (like Google, Bing or Yandex). Structured data, like that stored in a database is less intuitive and more difficult for beginners. For this reason, we will start by indexing unstructured documents.

A Review of Apache Solr Basics for Beginners