FactorPad
Build a Better Process

Solr Cell and the Apache Tika Parser : Examples and Syntax

The Apache Tika library parses file contents for thousands of file formats.
  1. About - Understand the purpose of Solr Cell and Tika.
  2. Syntax - See Solr Cell and Tika command usage in the solrconfig.xml, curl and bin/post.
  3. Options - View options for Tika for the Extracting Request Handler.
  4. Examples - Review several common examples.
face pic by Paul Alan Davis, CFA
Updated: February 25, 2021
Our focus is on the Linux command line so these commands also work for macOS. There may be slight differences for Windows.

Outline Back Tip Next

/ factorpad.com / tech / solr / reference / solr-cell.html


An ad-free and cookie-free website.


The Solr Cell and Apache Tika Command Reference

Beginner

When building public-facing website search or private enterprise search applications, developers are often faced with vast amounts of documents in a variety of formats. The Apache Solr application and its connection to Apache Tika through the Solr Cell framework offers a way to index documents regardless of file type. So binary files and plain-text files can be read, interpreted and indexed with the eventual goal of creating a useful index and search application for users.

Of course, customizations to the Solr configuration files are required. But before that, we want to see how these files will be parsed, pulling out both metadata and plain text from the files. The metadata may be useful for structured data in the Solr index and plain text may be useful for searches of unstructured document content. Either way, we need a starting point, and the Solr Cell framework offers a quick way to evaluate documents without having to learn the ins-and-outs of Apache Tika, and the language both it and Solr were written in, Java.

Apache Solr Reference

1. About Solr Cell and the Apache Tika Parser

The Apache Tika parser is capable of reading over 1,000 different file types and returning metadata information and plain text from the files. So if you post a document using the curl command, Tika will return the metadata fields and the plain text from a file by stripping out the markup and other information you don't want in a searchable index.

About metadata and Tika

Regarding metadata, or data about data. Files typically include data about the file itself. So a video file may have a metadata field that indicates the runtime of the video. An image file may have its dimensions and an HTML file may have keywords and a description. While there is no perfect standard, Apache Tika will attempt to categorize about 15 different fields. Some of the more common ones are listed below.

About plain text and Tika

With respect to stripping plain text from files, Apache Tika uses a variety of parser libraries to access text and binary files. Remember, files in PDF, XLSX, and DOC formats may be saved in binary format, so parsers need to be able to pull plain text and metadata accurately. Even plain-text formats like HTML are difficult because while many web developers write HTML documents that render well in a browser, they may not be written well technically. So Apache Tika uses a library called TagSoup to parse HTML documents.

In general, HTML is less strict than XML and to accomodate both, Apache Tika translates text from both formats to another strict format called XHTML so it can be interpreted by XML parsers.

From Tika to Solr

In Apache Solr, whether it be structured or unstructured data, we normally want to search both metadata fields and plain text. Before we post documents to the index, we can see what information Apache Tika learns about the document by using curl. This will help us troubleshoot and fine-tune our import settings in both the Solr schema.xml and solrconfig.xml.

2. Syntax for the solrconfig.xml, curl and bin/post

The general syntax to access the Apache Tika library is with the curl command, however we will also see the Solr bin/post tool. In order to use Solr Cell, it requires that several lines exist in the solrconfig.xml.

The solrconfig.xml configuration file requirements

For the default setup when you create a new core or collection, at least with Solr version 7, it will automatically copy over what is called a _default configset, or set of configuration files. In the solrconfig.xml file sits two sections. First is the section that pulls in the Solr Cell plugins.

<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" /> <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />

Second is the section identified as Solr Cell Update Request Handler. This is the part that must be customized for specific fields imported from the metadata. The default settings are listed below.

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.meta">ignored_<str> <str name="fmap.content">_text_</str> </lst> </requestHandler>

This Update Request Handler can be accessed using the HTTP protocol with curl using this general format.

$ curl "http://localhost:8983/solr/<core>/update/extract?<parameters>

Where localhost is the hostname or IP address of the Solr server and <core> is the name of the core or collection. The /update/extract section points to the Solr Cell ExtractingRequestHandler class set up in the solrconfig.xml as mentioned above. After the ? you include a variety of <parameters> detailed in section 3 below.

The curl command

As for a background on the curl command, it is used to communicate with servers using a variety of protocols, including FTP, HTTP, IMAP, POP3 and SMTP. The general format looks like this.

$ curl [options] [URL...]

When we submit requests to the Solr Cell Extracting Request Handler, we will use the HTTP protocol and the following options, or flags, associated with curl.

Below is an example of how you might use the curl command to see what data will be extracted from a file without posting it to the index.

$ curl "http://localhost:8983/solr/<core>/update/extract?&extractOnly=true" --data-binary @example.html -H "Content-type:text/html"

Here <core> refers to the name of the core or collection. The @ provided just before the location of the file attaches the file to the POST as if it were made from a form and the submit button was pressed. Here we assume that the file example.html is in the same directory but could also be an absolute or relative reference to another file. The extractOnly=true parameter only returns output from the Tika parser without posting the document to the index. The -H option allows you to additionally supply a content type for the file.

The Solr post tool with the bin/post script

Solr provides a shell script to post documents to an index with the following general format.

$ bin/post -c <name> [OPTIONS] <file|directories|urls|-d ["...",...]>

This syntax assumes your current working directory is the installation directory for Solr, which for version 7 would be ~/solr-7.0.0/ in standalone mode for a local installation. When running in a production environment the directory locations may differ.

If Solr Windows is your preferred environment for custom search, the solr post script is run by pointing to example\exampledocs\post.jar from the installation directory. To find help use java -jar example\exampledocs\post.jar -help. Please see the documentation for Windows as the rest of this page will refer to usage in Linux-type environments.

3. Options for the Solr Cell ExtractingRequestHandler

The curl and bin/post commands can take 16 options (parameters) relating to the ExtractingRequestHandler. Required fields include the core or collection name and the location of the files to post.

Option (parameter) Description Default
capture While document text will be captured in a field called "content", this parameter allows you to capture a specific section of the document, like those within <p> or <div> tags within HTML. none
captureAttr This can capture attributes within HTML tags. For example, a link using an <a href="filename.html"> tag could pull out the attribute href="filename.html". none
commitWithin To add a document within a specified number of milliseconds. none
date.formats To set up date formats for documents. none
defaultField If the uprefix parameter is not set and a field cannot be determined, then a field named like defaultField=text will be used. none
extractOnly Returns the content extracted by Tika without sending the document to the index. false
extractFormat If the extractOnly option is selected you can also use this extractFormat to specify whether it should return the response in xml or text format. xml
fmap.source_field Used to map fields named in the incoming document to fields within Solr given a different name. So fmap.title=about would map a field name title by Tika to one named about in Solr. none
ignoreTikaException If the ignoreTikaException=true is provided then exceptions will be skipped and metadata will be indexed. false
literal.fieldname This fills in a field with a literal value. For example, with literal.about=empty the about field for each document will be updated to read "empty". none
literalsOverride If literalsOverride=true then the literal.fieldname will be overridden by values provided by Tika. If literalsOverride=false then the values from Tika will be appended as another entry, so the field must be set to multiValued=true. true
lowernames This will change all field names to lowercase with underscores if lowernames=true. So a field named Content-Encoding would be changed to content_encoding. false
multipartUploadLimitInKB This will limit the size of documents allowed to be uploaded in KB. none
passwordsFile This allows you to point to a file with passwords. none
resource.name To specify an alternative name which Tika will use to detect the file's MIME type. none
resource.password Used to submit a password if the file has its own password. none
tika.config This allows you to point to a customized Tika configuration file for more advanced implementations. none
uprefix For fields that are not already in the Solr schema, this parameter will append text to the field name in Tika. This can be used with a dynamic field definition on the Solr end to create fields which Solr will then ignore. For example, using uprefix=ignored_ for the field content_type resulting in a field named ignored_content_type. Then on the Solr end each field that is prepended with ignored_ would not be imported. This assumes that there is a dynamicField in the schema.xml file with a line specified like this: <dynamicField name="ignored_*" type="ignored" /> none
xpath Used to restrict the format of the Tika document extraction to specific sections of the XHTML document. none

4. Examples of Solr Cell to the Apache Tika Parser

For these examples, assume we have an example HTML file saved as docs/test.html from the Solr installation directory.

Example 1 - View a parsed HTML file using curl

The following command parses a file called test.html and summarizes it at the terminal using a core named example without posting it to the index.

$ curl "http://localhost:8983/solr/example/update/extract?extractOnly=true" --data-binary @docs/test.html -H "Content-type:text/html"
Example 2 - View a parsed HTML file using bin/post

The following command allows you to pass parameters to bin/post viewing the extraction with output to the terminal for the core named example.

$ bin/post -c example -params "extractOnly=true" -out yes docs/test.html

Other Related Solr Content

FactorPad offers Apache Solr Search content in both tutorials and reference.


What's Next?

Check out our YouTube Channel for more free opportunities to learn.

Outline Back Tip Next

/ factorpad.com / tech / solr / reference / solr-cell.html


apache tika
apache solr
tika parser
tika examples
apache tika example
solr search
solr indexing
lucene reference
solr reference
solr tika
solr cell
content extraction
extract metadata
extractingRequestHandler
extract text

A newly-updated free resource. Connect and refer a friend today.