FactorPad
Faster Learning Tutorials

Solr Tokenizers : syntax, options and examples

A tokenizer in Solr breaks text streams into tokens and passes them on to the filter for additional text analytics.
  1. About - Understand the purpose of a tokenizer.
  2. Syntax - See how tokenizers are coded in schema.xml or managed-schema.
  3. Options - View different classes with descriptions and use cases.
  4. Examples - Review examples of commonly-used tokenizers.
by Paul Alan Davis, CFA, November 12, 2017
Updated: July 16, 2018
The tokenizer is the second step of the analyzer chain. Let's see how it works.

Outline Back Next

~/ home  / tech  / solr  / reference  / solr tokenizers


Using a Solr Tokenizer for Text Analytics

Beginner

Using a computer to break up and analyze word patterns is a big part of text analytics and natural language processing. These topics have taken on greater importance over time as more-and-more companies are looking for ways to analyze big data.

One such application is custom search and tools like Apache Solr and Elasticsearch have become more prominent as the rules for breaking text into smaller "tokens" has expanded beyond just search and into the field of text analytics. Both tools are built on the Apache Lucene libraries that make up the rules and algorithms for text analytics.

Here we are evaluating one step in the text analytics process within Solr called tokenizing. Our project is outlined for a custom search or enterprise search application, starting out for beginners in a test environment. Many beginners accept defaults with tokenizers at this stage and more fully explore the capabilities for customization when heading to a production environment.

Apache Solr Reference

1. About Field Type Tokenizers

For background, each field in Solr is assigned a fieldType and each fieldType processes text using an analyzer. The analyzer can be established in the schema in one of two ways. First, it may be one class that includes a tokenizer and one or several filters behind the scenes. Second, each tokenizer and filter may be spelled out explicitly. The former is for ease of use, and the latter allows for greater customization.

The tokenizer processes streams of text, breaks up the words into individual tokens and then passes them to the filter. The whole analysis process is used in two places, first when documents are indexed and second when the user submits a search query.

The tokenizer tag in XML is a child of an analyzer tag and it points to a class name which may require additional arguments. The block of analyzer settings sit in one of two XML files, called schema.xml or managed-schema.

2. Syntax for Tokenizers

Below is an example of a multi-line analyzer, including a tokenizer and filters.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

The Standard Tokenizer identified with the StandardTokenizerFactory class in this case, breaks up text using whitespaces and punctuation delimiters, plus breaks up words with hyphens into two tokens. Note that the single-line tokenizer tag ends with />.

3. Options for Tokenizers and Their Actions

The following is a list of common tokenizers by name and class name. Use this table in conjunction with analyzers and filters.

Name and Class Tokenizer Actions
Classic Tokenizer
ClassicTokenizerFactory
  • Keeps domain names and email addresses as a token
  • Words with hyphens and only letters are split into two tokens
  • Words with hyphens and numbers are retained as one token
  • Periods that are not followed by whitespace are kept
Keyword Tokenizer
KeywordTokenizerFactory
  • Keeps the entire text field as one token
  • Well suited for id fields and stuctured data
Letter Tokenizer
LetterTokenizerFactory
  • Retains continuous letters only
  • All other characters are removed
Lowercase Tokenizer
LowerCaseTokenizerFactory
  • Retains continuous letters only
  • All other characters removed
  • Makes all uppercase characters to lowercase
Standard Tokenizer
StandardTokenizerFactory
  • Treats whitespace and punctuation (not numbers) as delimiters
  • Periods that are not followed by whitespace are retained
  • Domain names are preserved but email addresses are not
  • Words with hyphens are split into two tokens
UAX29 URL Email Tokenizer
UAX29URLEmailTokenizerFactory
  • Treats whitespace and punctuation (not numbers) as delimiters
  • Periods that are not followed by whitespace are retained
  • Domain names are preserved but email addresses are not
  • For words with letters only, hyphens are split into two tokens
  • For words with letters and numbers, hyphens are split into two tokens
  • Preserves internet domain names, email addresses, full paths to file://, http(s)://,ftp:// addresses, and IPv4 and IPv6 addresses
White Space Tokenizer
WhiteSpaceTokenizerFactory
  • Uses whitespace only as delimiters
  • Returns everything else as-is, including punctuation

There are many other tokenizers including several that allow for tokens to be created with regular expressions, file path expansion and those that tokenize based simply on the number of characters.

4. Examples of Tokenizers

Example 1 - Classic Tokenizer

The ClassicTokenizerFactory class is effective for Internet domains and email addresses. Hyphenated word treatments are best suited for product names.

Input Output
E-mail joe@example.com regarding HP-60 ink "E", "mail", "joe@example.com", "regarding", "HP-60", "ink"
Example 2 - Keyword Tokenizer

The KewordTokenizerFactory class keeps the whole field intact and may be best suited for product names, IDs, structured data and keywords that need to be left intact.

Input Output
New York New York
Example 3 - Letter Tokenizer

The LetterTokenizerFactory uses all non-letters as delimiters, so further work in the filter stage is helpful.

Input Output
I sent 32 E-mails and didn't get 1 RESPONSE!! "I", "sent", "E", "mails", "and", "didn", "t", "get", "RESPONSE"
Example 4 - Lowercase Tokenizer

The LowerCaseTokenizerFactory uses letters as delimiters and lowercases text at the same time.

Input Output
I sent 32 E-mails and didn't get 1 RESPONSE!! "i", "sent", "e", "mails", "and", "didn", "t", "get", "response"
Example 5 - Standard Tokenizer

The StandardTokenizerFactory works well with domains, but not email addresses, and also preserves numbers.

Input Output
Address: 123 Burns St. Weed, CA 96094-1234. E-mail: joe@example.com Web: example.com "Address", "123", "Burns", "St", "Weed", "CA", "96094", "1234", "E", "Mail", "joe", "example.com", "Web", "example.com"
Example 6 - UAX29 URL Email Tokenizer

The UAX29URLEmailTokenizerFactory is well suited for Internet-related information.

Input Output
Have you seen http://example.com yet? It has 10,000 web-pages. "Have", "you", "seen", "http://example.com", "yet", "It", "has", "10", "000", "web", "pages"
Example 7 - White Space Tokenizer

The WhiteSpaceTokenizerFactory class is a highly simplified tokenizer that works best with text as opposed to structured data with punctuation.

Input Output
This-tokenizer-only-delimits-at-white-spaces. Wow! "This-tokenizer-only-delimits-at-white-spaces." "Wow!"

As you can see, the natural language processing and text analytics process can get very confusing. With practice and a focus on incoming fields the developer can fine-tune the search application. In the beginning for a test environment, many developers need to accept that a search tool is "good enough" and postpone improvements until after gathering feedback.


Other Related Solr Content

FactorPad offers Apache Solr Search content in both tutorials and reference.


What's Next?

If you enjoyed free learning here, please check out our YouTube Channel for more. Subscribe here.

  • To see the outline of Solr reference material, click Outline.
  • To learn about analyzers, click Back.
  • To see how filters work, Click Next.

Outline Back Next

~/ home  / tech  / solr  / reference  / solr tokenizers



 
 
Keywords:
solr reference
apache solr
solr search
custom search
enterprise search
apache lucene
lucene reference
solr examples
solr tokenizer
solr tokenizer syntax
solr help
solr tokens
google custom search
amazon cloudsearch
elasticsearch
solr managed-schema
solr schema
text analytics