/ factorpad.com / tech / solr / reference / solr-tokenizers.html
An ad-free and cookie-free website.
Beginner
Using a computer to break up and analyze word patterns is a big part of text analytics and natural language processing. These topics have taken on greater importance over time as more-and-more companies are looking for ways to analyze big data.
One such application is custom search and tools like Apache Solr and Elasticsearch have become more prominent as the rules for breaking text into smaller "tokens" has expanded beyond just search and into the field of text analytics. Both tools are built on the Apache Lucene libraries that make up the rules and algorithms for text analytics.
Here we are evaluating one step in the text analytics process within Solr called tokenizing. Our project is outlined for a custom search or enterprise search application, starting out for beginners in a test environment. Many beginners accept defaults with tokenizers at this stage and more fully explore the capabilities for customization when heading to a production environment.
For background, each field in Solr is assigned a fieldType and each fieldType processes text using an analyzer. The analyzer can be established in the schema in one of two ways. First, it may be one class that includes a tokenizer and one or several filters behind the scenes. Second, each tokenizer and filter may be spelled out explicitly. The former is for ease of use, and the latter allows for greater customization.
The tokenizer processes streams of text, breaks up the words into individual tokens and then passes them to the filter. The whole analysis process is used in two places, first when documents are indexed and second when the user submits a search query.
The tokenizer tag in XML is a child of an analyzer tag and it points
to a class name which may require additional arguments. The block of
analyzer settings sit in one of two XML files, called
schema.xml
or
managed-schema
.
Below is an example of a multi-line analyzer, including a tokenizer and filters.
The Standard Tokenizer identified with the
StandardTokenizerFactory class in this case, breaks up
text using whitespaces and punctuation delimiters, plus breaks up words
with hyphens into two tokens. Note that the single-line tokenizer tag
ends with />
.
The following is a list of common tokenizers by name and class name. Use this table in conjunction with analyzers and filters.
Name and Class | Tokenizer Actions |
---|---|
Classic TokenizerClassicTokenizerFactory
|
|
Keyword TokenizerKeywordTokenizerFactory
|
|
Letter TokenizerLetterTokenizerFactory
|
|
Lowercase TokenizerLowerCaseTokenizerFactory
|
|
Standard TokenizerStandardTokenizerFactory
|
|
UAX29 URL Email TokenizerUAX29URLEmailTokenizerFactory
|
|
White Space TokenizerWhiteSpaceTokenizerFactory
|
|
There are many other tokenizers including several that allow for tokens to be created with regular expressions, file path expansion and those that tokenize based simply on the number of characters.
The ClassicTokenizerFactory
class is
effective for Internet domains and email addresses. Hyphenated word
treatments are best suited for product names.
Input | Output |
---|---|
E-mail joe@example.com regarding HP-60 ink | "E", "mail", "joe@example.com", "regarding", "HP-60", "ink" |
The KewordTokenizerFactory
class keeps
the whole field intact and may be best suited for product names, IDs,
structured data and keywords that need to be left intact.
Input | Output |
---|---|
New York | New York |
The LetterTokenizerFactory
uses all
non-letters as delimiters, so further work in the filter stage is
helpful.
Input | Output |
---|---|
I sent 32 E-mails and didn't get 1 RESPONSE!! | "I", "sent", "E", "mails", "and", "didn", "t", "get", "RESPONSE" |
The LowerCaseTokenizerFactory
uses
letters as delimiters and lowercases text at the same time.
Input | Output |
---|---|
I sent 32 E-mails and didn't get 1 RESPONSE!! | "i", "sent", "e", "mails", "and", "didn", "t", "get", "response" |
The StandardTokenizerFactory
works
well with domains, but not email addresses, and also preserves numbers.
Input | Output |
---|---|
Address: 123 Burns St. Weed, CA 96094-1234. E-mail: joe@example.com Web: example.com | "Address", "123", "Burns", "St", "Weed", "CA", "96094", "1234", "E", "Mail", "joe", "example.com", "Web", "example.com" |
The UAX29URLEmailTokenizerFactory
is well suited for Internet-related information.
Input | Output |
---|---|
Have you seen http://example.com yet? It has 10,000 web-pages. | "Have", "you", "seen", "http://example.com", "yet", "It", "has", "10", "000", "web", "pages" |
The WhiteSpaceTokenizerFactory
class
is a highly simplified tokenizer that works best with text as opposed
to structured data with punctuation.
Input | Output |
---|---|
This-tokenizer-only-delimits-at-white-spaces. Wow! | "This-tokenizer-only-delimits-at-white-spaces." "Wow!" |
As you can see, the natural language processing and text analytics process can get very confusing. With practice and a focus on incoming fields the developer can fine-tune the search application. In the beginning for a test environment, many developers need to accept that a search tool is "good enough" and postpone improvements until after gathering feedback.
FactorPad offers Apache Solr Search content in both tutorials and reference.
If you enjoyed free learning here, please check out our YouTube Channel for more. Subscribe and follow @factorpad on Twitter for updates.
/ factorpad.com / tech / solr / reference / solr-tokenizers.html
A newly-updated free resource. Connect and refer a friend today.