Solr Cell - Apache Tika Syntax and Examples | Lucene and Solr Reference

The Solr Cell and Apache Tika Command Reference

Beginner

When building public-facing website search or private enterprise search applications, developers are often faced with vast amounts of documents in a variety of formats. The Apache Solr application and its connection to Apache Tika through the Solr Cell framework offers a way to index documents regardless of file type. So binary files and plain-text files can be read, interpreted and indexed with the eventual goal of creating a useful index and search application for users.

Of course, customizations to the Solr configuration files are required. But before that, we want to see how these files will be parsed, pulling out both metadata and plain text from the files. The metadata may be useful for structured data in the Solr index and plain text may be useful for searches of unstructured document content. Either way, we need a starting point, and the Solr Cell framework offers a quick way to evaluate documents without having to learn the ins-and-outs of Apache Tika, and the language both it and Solr were written in, Java.

Apache Solr Reference

1. About Solr Cell and the Apache Tika Parser

The Apache Tika parser is capable of reading over 1,000 different file types and returning metadata information and plain text from the files. So if you post a document using the curl command, Tika will return the metadata fields and the plain text from a file by stripping out the markup and other information you don't want in a searchable index.

About metadata and Tika

Regarding metadata, or data about data. Files typically include data about the file itself. So a video file may have a metadata field that indicates the runtime of the video. An image file may have its dimensions and an HTML file may have keywords and a description. While there is no perfect standard, Apache Tika will attempt to categorize about 15 different fields. Some of the more common ones are listed below.

Creator - including website, publisher or author
Date - including update frequency or date modified
Description - a description of the content
Format- the media type or dimensions
Language - the associated spoken language
Title - the name associated with the file

About plain text and Tika

With respect to stripping plain text from files, Apache Tika uses a variety of parser libraries to access text and binary files. Remember, files in PDF, XLSX, and DOC formats may be saved in binary format, so parsers need to be able to pull plain text and metadata accurately. Even plain-text formats like HTML are difficult because while many web developers write HTML documents that render well in a browser, they may not be written well technically. So Apache Tika uses a library called TagSoup to parse HTML documents.

In general, HTML is less strict than XML and to accomodate both, Apache Tika translates text from both formats to another strict format called XHTML so it can be interpreted by XML parsers.

From Tika to Solr

In Apache Solr, whether it be structured or unstructured data, we normally want to search both metadata fields and plain text. Before we post documents to the index, we can see what information Apache Tika learns about the document by using curl. This will help us troubleshoot and fine-tune our import settings in both the Solr schema.xml and solrconfig.xml.

2. Syntax for the solrconfig.xml, curl and bin/post

The general syntax to access the Apache Tika library is with the curl command, however we will also see the Solr bin/post tool. In order to use Solr Cell, it requires that several lines exist in the solrconfig.xml.

The solrconfig.xml configuration file requirements

For the default setup when you create a new core or collection, at least with Solr version 7, it will automatically copy over what is called a _default configset, or set of configuration files. In the solrconfig.xml file sits two sections. First is the section that pulls in the Solr Cell plugins.

Second is the section identified as Solr Cell Update Request Handler. This is the part that must be customized for specific fields imported from the metadata. The default settings are listed below.

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.meta">ignored_<str> <str name="fmap.content">_text_</str> </lst> </requestHandler>

This Update Request Handler can be accessed using the HTTP protocol with curl using this general format.

$ curl "http://localhost:8983/solr/<core>/update/extract?<parameters>

Where localhost is the hostname or IP address of the Solr server and <core> is the name of the core or collection. The /update/extract section points to the Solr Cell ExtractingRequestHandler class set up in the solrconfig.xml as mentioned above. After the ? you include a variety of <parameters> detailed in section 3 below.

The curl command

As for a background on the curl command, it is used to communicate with servers using a variety of protocols, including FTP, HTTP, IMAP, POP3 and SMTP. The general format looks like this.

$ curl [options] [URL...]

When we submit requests to the Solr Cell Extracting Request Handler, we will use the HTTP protocol and the following options, or flags, associated with curl.

--data-binary - This flag posts data exactly as sent without additional processing.
-F - This flag uses the HTTP request as if it were a POST which allows you to upload a file.
-H - This flag is used to send extra headers when using HTTP to communicate with a server.

Below is an example of how you might use the curl command to see what data will be extracted from a file without posting it to the index.

$ curl "http://localhost:8983/solr/<core>/update/extract?&extractOnly=true" --data-binary @example.html -H "Content-type:text/html"

Here <core> refers to the name of the core or collection. The @ provided just before the location of the file attaches the file to the POST as if it were made from a form and the submit button was pressed. Here we assume that the file example.html is in the same directory but could also be an absolute or relative reference to another file. The extractOnly=true parameter only returns output from the Tika parser without posting the document to the index. The -H option allows you to additionally supply a content type for the file.

The Solr post tool with the bin/post script

Solr provides a shell script to post documents to an index with the following general format.

$ bin/post -c <name> [OPTIONS] <file|directories|urls|-d ["...",...]>

This syntax assumes your current working directory is the installation directory for Solr, which for version 7 would be ~/solr-7.0.0/ in standalone mode for a local installation. When running in a production environment the directory locations may differ.

If Solr Windows is your preferred environment for custom search, the solr post script is run by pointing to example\exampledocs\post.jar from the installation directory. To find help use java -jar example\exampledocs\post.jar -help. Please see the documentation for Windows as the rest of this page will refer to usage in Linux-type environments.

3. Options for the Solr Cell ExtractingRequestHandler

The curl and bin/post commands can take 16 options (parameters) relating to the ExtractingRequestHandler. Required fields include the core or collection name and the location of the files to post.

Option (parameter)	Description	Default
`capture`	While document text will be captured in a field called "content", this parameter allows you to capture a specific section of the document, like those within <p> or <div> tags within HTML.	none
`captureAttr`	This can capture attributes within HTML tags. For example, a link using an <a href="filename.html"> tag could pull out the attribute href="filename.html".	none
`commitWithin`	To add a document within a specified number of milliseconds.	none
`date.formats`	To set up date formats for documents.	none
`defaultField`	If the uprefix parameter is not set and a field cannot be determined, then a field named like defaultField=text will be used.	none
`extractOnly`	Returns the content extracted by Tika without sending the document to the index.	false
`extractFormat`	If the extractOnly option is selected you can also use this extractFormat to specify whether it should return the response in xml or text format.	xml
`fmap.source_field`	Used to map fields named in the incoming document to fields within Solr given a different name. So fmap.title=about would map a field name title by Tika to one named about in Solr.	none
`ignoreTikaException`	If the ignoreTikaException=true is provided then exceptions will be skipped and metadata will be indexed.	false
`literal.fieldname`	This fills in a field with a literal value. For example, with literal.about=empty the about field for each document will be updated to read "empty".	none
`literalsOverride`	If literalsOverride=true then the literal.fieldname will be overridden by values provided by Tika. If literalsOverride=false then the values from Tika will be appended as another entry, so the field must be set to multiValued=true.	true
`lowernames`	This will change all field names to lowercase with underscores if lowernames=true. So a field named Content-Encoding would be changed to content_encoding.	false
`multipartUploadLimitInKB`	This will limit the size of documents allowed to be uploaded in KB.	none
`passwordsFile`	This allows you to point to a file with passwords.	none
`resource.name`	To specify an alternative name which Tika will use to detect the file's MIME type.	none
`resource.password`	Used to submit a password if the file has its own password.	none
`tika.config`	This allows you to point to a customized Tika configuration file for more advanced implementations.	none
`uprefix`	For fields that are not already in the Solr schema, this parameter will append text to the field name in Tika. This can be used with a dynamic field definition on the Solr end to create fields which Solr will then ignore. For example, using uprefix=ignored_ for the field content_type resulting in a field named ignored_content_type. Then on the Solr end each field that is prepended with ignored_ would not be imported. This assumes that there is a dynamicField in the schema.xml file with a line specified like this: *<dynamicField name="ignored_" type="ignored" />**	none
`xpath`	Used to restrict the format of the Tika document extraction to specific sections of the XHTML document.	none

4. Examples of Solr Cell to the Apache Tika Parser

For these examples, assume we have an example HTML file saved as docs/test.html from the Solr installation directory.

Example 1 - View a parsed HTML file using curl

The following command parses a file called test.html and summarizes it at the terminal using a core named example without posting it to the index.

$ curl "http://localhost:8983/solr/example/update/extract?extractOnly=true" --data-binary @docs/test.html -H "Content-type:text/html"

Example 2 - View a parsed HTML file using bin/post

The following command allows you to pass parameters to bin/post viewing the extraction with output to the terminal for the core named example.

$ bin/post -c example -params "extractOnly=true" -out yes docs/test.html

Solr Cell and the Apache Tika Parser : Examples and Syntax