Solr Tokenizers Syntax and Examples | Lucene and Solr Reference

Using a Solr Tokenizer for Text Analytics

Beginner

Using a computer to break up and analyze word patterns is a big part of text analytics and natural language processing. These topics have taken on greater importance over time as more-and-more companies are looking for ways to analyze big data.

One such application is custom search and tools like Apache Solr and Elasticsearch have become more prominent as the rules for breaking text into smaller "tokens" has expanded beyond just search and into the field of text analytics. Both tools are built on the Apache Lucene libraries that make up the rules and algorithms for text analytics.

Here we are evaluating one step in the text analytics process within Solr called tokenizing. Our project is outlined for a custom search or enterprise search application, starting out for beginners in a test environment. Many beginners accept defaults with tokenizers at this stage and more fully explore the capabilities for customization when heading to a production environment.

Apache Solr Reference

1. About Field Type Tokenizers

For background, each field in Solr is assigned a fieldType and each fieldType processes text using an analyzer. The analyzer can be established in the schema in one of two ways. First, it may be one class that includes a tokenizer and one or several filters behind the scenes. Second, each tokenizer and filter may be spelled out explicitly. The former is for ease of use, and the latter allows for greater customization.

The tokenizer processes streams of text, breaks up the words into individual tokens and then passes them to the filter. The whole analysis process is used in two places, first when documents are indexed and second when the user submits a search query.

The tokenizer tag in XML is a child of an analyzer tag and it points to a class name which may require additional arguments. The block of analyzer settings sit in one of two XML files, called schema.xml or managed-schema.

2. Syntax for Tokenizers

Below is an example of a multi-line analyzer, including a tokenizer and filters.

The Standard Tokenizer identified with the StandardTokenizerFactory class in this case, breaks up text using whitespaces and punctuation delimiters, plus breaks up words with hyphens into two tokens. Note that the single-line tokenizer tag ends with />.

3. Options for Tokenizers and Their Actions

The following is a list of common tokenizers by name and class name. Use this table in conjunction with analyzers and filters.

Name and Class	Tokenizer Actions
Classic Tokenizer `ClassicTokenizerFactory`	Keeps domain names and email addresses as a token Words with hyphens and only letters are split into two tokens Words with hyphens and numbers are retained as one token Periods that are not followed by whitespace are kept
Keyword Tokenizer `KeywordTokenizerFactory`	Keeps the entire text field as one token Well suited for id fields and stuctured data
Letter Tokenizer `LetterTokenizerFactory`	Retains continuous letters only All other characters are removed
Lowercase Tokenizer `LowerCaseTokenizerFactory`	Retains continuous letters only All other characters removed Makes all uppercase characters to lowercase
Standard Tokenizer `StandardTokenizerFactory`	Treats whitespace and punctuation (not numbers) as delimiters Periods that are not followed by whitespace are retained Domain names are preserved but email addresses are not Words with hyphens are split into two tokens
UAX29 URL Email Tokenizer `UAX29URLEmailTokenizerFactory`	Treats whitespace and punctuation (not numbers) as delimiters Periods that are not followed by whitespace are retained Domain names are preserved but email addresses are not For words with letters only, hyphens are split into two tokens For words with letters and numbers, hyphens are split into two tokens Preserves internet domain names, email addresses, full paths to file://, http(s)://,ftp:// addresses, and IPv4 and IPv6 addresses
White Space Tokenizer `WhiteSpaceTokenizerFactory`	Uses whitespace only as delimiters Returns everything else as-is, including punctuation

There are many other tokenizers including several that allow for tokens to be created with regular expressions, file path expansion and those that tokenize based simply on the number of characters.

4. Examples of Tokenizers

Example 1 - Classic Tokenizer

The ClassicTokenizerFactory class is effective for Internet domains and email addresses. Hyphenated word treatments are best suited for product names.

Input	Output
E-mail joe@example.com regarding HP-60 ink	"E", "mail", "joe@example.com", "regarding", "HP-60", "ink"

Example 2 - Keyword Tokenizer

The KewordTokenizerFactory class keeps the whole field intact and may be best suited for product names, IDs, structured data and keywords that need to be left intact.

Input	Output
New York	New York

Example 3 - Letter Tokenizer

The LetterTokenizerFactory uses all non-letters as delimiters, so further work in the filter stage is helpful.

Input	Output
I sent 32 E-mails and didn't get 1 RESPONSE!!	"I", "sent", "E", "mails", "and", "didn", "t", "get", "RESPONSE"

Example 4 - Lowercase Tokenizer

The LowerCaseTokenizerFactory uses letters as delimiters and lowercases text at the same time.

Input	Output
I sent 32 E-mails and didn't get 1 RESPONSE!!	"i", "sent", "e", "mails", "and", "didn", "t", "get", "response"

Example 5 - Standard Tokenizer

The StandardTokenizerFactory works well with domains, but not email addresses, and also preserves numbers.

Input	Output
Address: 123 Burns St. Weed, CA 96094-1234. E-mail: joe@example.com Web: example.com	"Address", "123", "Burns", "St", "Weed", "CA", "96094", "1234", "E", "Mail", "joe", "example.com", "Web", "example.com"

Example 6 - UAX29 URL Email Tokenizer

The UAX29URLEmailTokenizerFactory is well suited for Internet-related information.

Input	Output
Have you seen http://example.com yet? It has 10,000 web-pages.	"Have", "you", "seen", "http://example.com", "yet", "It", "has", "10", "000", "web", "pages"

Example 7 - White Space Tokenizer

The WhiteSpaceTokenizerFactory class is a highly simplified tokenizer that works best with text as opposed to structured data with punctuation.

Input	Output
This-tokenizer-only-delimits-at-white-spaces. Wow!	"This-tokenizer-only-delimits-at-white-spaces." "Wow!"

As you can see, the natural language processing and text analytics process can get very confusing. With practice and a focus on incoming fields the developer can fine-tune the search application. In the beginning for a test environment, many developers need to accept that a search tool is "good enough" and postpone improvements until after gathering feedback.

Solr Tokenizers : Syntax, Options and Examples