The following reference is intended for developers evaluating Apache Solr for enterprise search or website search applications. Both Apache Solr and Elasticsearch use Lucene libraries for custom search.
Solr schema refers to a configuration file that instructs Solr how to index documents, plus which Fields to display in search results. Documents may contain structured data as you might find in a database like an online store, or unstructured data as used in full text search applications like search engines.
The Solr schema is formatted in the file named
managed-schema when the user elects to
make modifications using the Solr Schema API, or
schema.xml for more advanced users
who modify the schema by hand.
Fields in Solr are related to the documents themselves and the information being searched for. Each Field is assigned a Field Type which provides rules for how Fields of that type should be processed during indexing and search.
version="1.6" attribute in the
schema dictates default values for each Field Type class, with 1.6
being the schema version for Solr version 7. These may be overridden at
the Field level.
The easiest way to think about defaults is that each Field Type class dictates the default values. These defaults are listed in the tables below, but they can be overridden at the Field Type level or the Field level.
An example for both Field Types and Fields might look like this (including the XML and schema tags).
The schema file is typically hundreds of lines long, and above is a snippet first of a Field Type that processes Fields that are given the type="text_general" name. In this case, the title Field pulled from the indexed document is assigned this Field Type.
Where you see indexed="true" and stored="true" in the Field tag, these are examples of Field properties. They dictate whether information is being stored in the index and whether it can be accessed during a search.
The table below provides a list and description of 19 properties that can be included in the Field tag and will override defaults for that Field Type.
Below are 19 Field properties provided by Solr with defaults.
The first table represents the most commonly used properties for
beginners. All properties are entered as either
false, except the name
and type field definitions.
The list of 8 common Field properties relate to whether Fields are stored and can be retrieved during search. Also, similar to a database, whether they are required and can have multiple values.
Remember, part of the goal is to minimize the size of an index, and these settings allow you to customize your index and turn on the features you need.
||The name for the field.||--|
||Points to a FieldType within the same schema that controls behaviors for all fields of that type.||true|
||The field will be populated with the value (default="value") if no data is supplied at index time.||none|
||Only when true is selected can the Field be searched or sorted in queries to retrieve matching documents.||true|
||Only when true is selected can the Field be retrieved in queries.||true|
||When true Solr will not add documents to the index where a value in this Field is missing. This is common for id Fields and structured data.||false|
||When set to true then a document may have multiple values of this Field or Field Type. Similar to a one-to-many relationship in a database.||false|
||When true the value in a Field will be added to an additional structure called DocValues that is helpful for retrieving information that will be used to sort, highlight terms or provide facets (groupings). A standard inverted index is not ideally suited for this type of operation, so DocValues adds columns to the index. This adds to the size and complexity of the index, so if you are not sorting, highlighting and faceting, then the setting should be false. docValues are only available for some Field Types.||false|
The name field should use the convention of starting with a letter. Those with leading and trailing underscores are reserved for those like _version_, _text_ and _root_ which are four pre-declared fields in the _default configset along with id.
The following table of 11 properties relates to finer points of index construction and will impact the size of the index and its ability to find and rank documents during search.
||Documents are sorted on a specified Field, when none is provided and true is specified, then those with missing data in the specified Field show up first when sorted. This works for string, boolean, date and numeric data types only.||false|
||Documents are sorted on a specified Field, when none is provided and true is specified, then those with missing data in the specified Field show up last when sorted. This works for string, boolean, date and numeric data types only.||false|
||When true it disables length normalization for text Fields. Defaults to true for non-analyzed Field Types such as BinaryField, BoolField, IntPointField and StrField, and false for text fields.||true|
||When text fields are tokenized, tokens include information on the frequency, position and payloads which are used in document ranking. It defaults to true for non-text fields and false for text fields.||true|
||Omits the position information from tokens.||true|
||Maintains locations of tokens in documents, helpful for MoreLikeThis where document similarity is required.||false|
||Maintains position information for tokens in documents.||false|
||Maintains offset information for advanced Field parsing.||false|
||Maintains information for document scoring.||false|
||If the Field has stored="false" and this Field set to true would allow for the Field to be returned with "*" in the fl search parameter. Defaults to true.||true|
||If stored="true" and multiValued="false" then this can be used to adjust whether large Fields are cached or not, thus improving performance.||false|
In this case, a Field is given two required properties that make it suitable as a unique key.
In this case, a Field is included in the index, and searches can be performed within the Field.
In this case, we set up a Field that can be returned in queries.
Alternatively, if you are using docValues you could use docValues="true".
In the following case a Field can be used to sort documents.
It is advised to use docValues="true" for integer and floating point Field types and use omitNorms.
In the following case Fields can be returned with highlighting.
Here a tokenizer must be used for the Field. Also, termVectors is not required, but must be set to true for termPositions to be used.
In the following case a Field can be used for Field faceting.
It is advised to use docValues="true" for faceting but not required.
FactorPad offers Apache Solr Search content in both tutorials and reference.
Our YouTube Channel is built for developers like you. Subscribe here.