An ad-free and cookie-free website.
When building public-facing website search or private enterprise search applications, developers are often faced with vast amounts of documents in a variety of formats. The Apache Solr application and its connection to Apache Tika through the Solr Cell framework offers a way to index documents regardless of file type. So binary files and plain-text files can be read, interpreted and indexed with the eventual goal of creating a useful index and search application for users.
Of course, customizations to the Solr configuration files are required. But before that, we want to see how these files will be parsed, pulling out both metadata and plain text from the files. The metadata may be useful for structured data in the Solr index and plain text may be useful for searches of unstructured document content. Either way, we need a starting point, and the Solr Cell framework offers a quick way to evaluate documents without having to learn the ins-and-outs of Apache Tika, and the language both it and Solr were written in, Java.
The Apache Tika parser is capable of reading over 1,000 different file
types and returning metadata information and plain text from the files.
So if you post a document using the
curl command, Tika will return the
metadata fields and the plain text from a file by stripping out the
markup and other information you don't want in a searchable index.
Regarding metadata, or data about data. Files typically include data about the file itself. So a video file may have a metadata field that indicates the runtime of the video. An image file may have its dimensions and an HTML file may have keywords and a description. While there is no perfect standard, Apache Tika will attempt to categorize about 15 different fields. Some of the more common ones are listed below.
With respect to stripping plain text from files, Apache Tika uses a variety of parser libraries to access text and binary files. Remember, files in PDF, XLSX, and DOC formats may be saved in binary format, so parsers need to be able to pull plain text and metadata accurately. Even plain-text formats like HTML are difficult because while many web developers write HTML documents that render well in a browser, they may not be written well technically. So Apache Tika uses a library called TagSoup to parse HTML documents.
In general, HTML is less strict than XML and to accomodate both, Apache Tika translates text from both formats to another strict format called XHTML so it can be interpreted by XML parsers.
In Apache Solr, whether it be structured or unstructured data, we
normally want to search both metadata fields and plain text. Before we
post documents to the index, we can see what information Apache Tika
learns about the document by using
curl. This will help us troubleshoot
and fine-tune our import settings in both the Solr
The general syntax to access the Apache Tika library is with the
curl command, however we will also see
bin/post tool. In order to
use Solr Cell, it requires that several lines exist in the
For the default setup when you create a new core or collection, at
least with Solr version 7, it will automatically copy over what is
called a _default configset, or set of configuration
files. In the
solrconfig.xml file sits
two sections. First is the section that pulls in the Solr Cell plugins.
Second is the section identified as Solr Cell Update Request Handler. This is the part that must be customized for specific fields imported from the metadata. The default settings are listed below.
This Update Request Handler can be accessed using the HTTP protocol
curl using this general format.
Where localhost is the hostname or IP address of the
Solr server and <core> is the name of the core or
collection. The /update/extract section points to the
Solr Cell ExtractingRequestHandler class set up in the
solrconfig.xml as mentioned above.
After the ? you include a variety of
<parameters> detailed in section 3 below.
As for a background on the
command, it is used to communicate with servers using a variety of
protocols, including FTP, HTTP, IMAP, POP3 and SMTP. The general
format looks like this.
When we submit requests to the Solr Cell Extracting Request
Handler, we will use the HTTP protocol and the following options, or
flags, associated with
--data-binary - This flag posts
data exactly as sent without additional processing.
-F - This flag uses the HTTP
request as if it were a POST which allows you to upload a file.
-H - This flag is used to send
extra headers when using HTTP to communicate with a server.
Below is an example of how you might use the
curl command to see what data will
be extracted from a file without posting it to the index.
Here <core> refers to the name of the core or collection. The @ provided just before the location of the file attaches the file to the POST as if it were made from a form and the submit button was pressed. Here we assume that the file example.html is in the same directory but could also be an absolute or relative reference to another file. The extractOnly=true parameter only returns output from the Tika parser without posting the document to the index. The -H option allows you to additionally supply a content type for the file.
Solr provides a shell script to post documents to an index with the following general format.
This syntax assumes your current working directory is the installation
directory for Solr, which for version 7 would be
~/solr-7.0.0/ in standalone mode for
a local installation. When running in a production environment the
directory locations may differ.
If Solr Windows is your preferred environment for custom search, the
solr post script is run by pointing to
example\exampledocs\post.jar from the
installation directory. To find help use
java -jar example\exampledocs\post.jar
-help. Please see the documentation for Windows as the rest of
this page will refer to usage in Linux-type environments.
bin/post commands can take 16
options (parameters) relating to the ExtractingRequestHandler. Required
fields include the core or collection name and the location of the
files to post.
|While document text will be captured in a field called "content", this parameter allows you to capture a specific section of the document, like those within <p> or <div> tags within HTML.
|This can capture attributes within HTML tags. For example, a link using an <a href="filename.html"> tag could pull out the attribute href="filename.html".
|To add a document within a specified number of milliseconds.
|To set up date formats for documents.
|If the uprefix parameter is not set and a field cannot be determined, then a field named like defaultField=text will be used.
|Returns the content extracted by Tika without sending the document to the index.
|If the extractOnly option is selected you can also use this extractFormat to specify whether it should return the response in xml or text format.
|Used to map fields named in the incoming document to fields within Solr given a different name. So fmap.title=about would map a field name title by Tika to one named about in Solr.
|If the ignoreTikaException=true is provided then exceptions will be skipped and metadata will be indexed.
|This fills in a field with a literal value. For example, with literal.about=empty the about field for each document will be updated to read "empty".
|If literalsOverride=true then the literal.fieldname will be overridden by values provided by Tika. If literalsOverride=false then the values from Tika will be appended as another entry, so the field must be set to multiValued=true.
|This will change all field names to lowercase with underscores if lowernames=true. So a field named Content-Encoding would be changed to content_encoding.
|This will limit the size of documents allowed to be uploaded in KB.
|This allows you to point to a file with passwords.
|To specify an alternative name which Tika will use to detect the file's MIME type.
|Used to submit a password if the file has its own password.
|This allows you to point to a customized Tika configuration file for more advanced implementations.
|For fields that are not already in the Solr schema, this parameter will append text to the field name in Tika. This can be used with a dynamic field definition on the Solr end to create fields which Solr will then ignore. For example, using uprefix=ignored_ for the field content_type resulting in a field named ignored_content_type. Then on the Solr end each field that is prepended with ignored_ would not be imported. This assumes that there is a dynamicField in the schema.xml file with a line specified like this: <dynamicField name="ignored_*" type="ignored" />
|Used to restrict the format of the Tika document extraction to specific sections of the XHTML document.
For these examples, assume we have an example HTML
file saved as
docs/test.html from the
Solr installation directory.
The following command parses a file called
test.html and summarizes it at the
terminal using a core named example without posting
it to the index.
The following command allows you to pass parameters to
bin/post viewing the extraction
with output to the terminal for the core named example.
FactorPad offers Apache Solr Search content in both tutorials and reference.
Check out our YouTube Channel for more free opportunities to learn.
A newly-updated free resource. Connect and refer a friend today.