/ factorpad.com / tech / solr / tutorial / solr-web-crawl.html
An ad-free and cookie-free website.
Beginner
If you have been with us since the start, here all of our hard work comes together. We go from start to finish, capturing the key components of previous tutorials. As with our previous dataset called films, we will continue with a simple structure, so a core in Standalone mode. And because it is easier to build a test environment on a local installation instead of a production server, we will leave topics like SolrCloud mode for later. What is different is that instead of using structured data we will incorporate unstructured data from a website crawl.
If you are new to our Solr Tutorial series then some terms I use may not make sense and if this project interests you, I suggest heading to the start. We somewhat painstakingly went through the details earlier and won't spend much time doing so here.
So why is website search so important in 2017? Well, if you noticed it is becoming more and more of a search world. The quality of results from the search box on many websites is a key differentiator. Long gone are the days when web surfers would navigate menus and a directory structures, there's just too much competition out there. Could you imagine asking users to navigate menus on Amazon or eBay, social posts on Instagram or jobs on LinkedIn? It just doesn't happen like that, which is why search is so important.
Of course many use Google to find everything. For custom website search, developers do have a few options, but many selected to take the easy route with Google Site Search because the quality of results matches what people are accustomed to at Google.com. In 2017, this option is no longer available, so many are scrambling for either a managed search offering or tools to build a custom search application with Elasticsearch or Apache Solr.
So with that as a backdrop, let's have some fun with a web crawl in Apache Solr Search.
Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).
Solr Web Crawl - Crawl Websites and Search in Apache Solr (17:12)
Moving on to Step 1, we will focus on a core in Standalone mode as opposed to a collection in the distributed SolrCloud mode. Also, we will use what is called a "Schemaless" configuration which allows the Solr Lucene combination of programs to interpret and modify the schema as we send HTML documents to be indexed.
We are sitting in the installation directory and in my case this
is solr-7.0.0
and we will access
the command line tools from here.
We can see two directories relevant to us,
bin
and
server
.
The bin
directory includes
two scripts we will need. First, the
bin/solr
script which has 12
commands used to manage the server instance and build cores. Second is
the bin/post
tool used to post
documents like web pages and create the index.
Once built, the core and all of its data will sit in the
server
directory and inside the
solr
sub-directory.
We can use the bin/solr status
command to see that we have a server node up and running.
For newcomers, you can go back a few tutorials to see how we did this,
or use the bin/solr start
command to
start an instance with all of the default settings.
In our next step when we build a core it will sit in solr_home identified in this output, which is the home to all Solr cores on this instance.
Now to build a core with the simpleist of settings, we just need to
point to the bin/solr create_core
command and give it a unique name after the
-c
flag. We will call it
solrhelp
.
You can review the output later, but to summarize what happened, Solr
copied default configuration files from a directory called
_default to this new core called
solrhelp
in a
subdirectory called conf
. This output
also offers a gentle reminder that configuration files need to be
customized before heading to a production environment. We will revisit
this topic shortly.
For Step 2, we will post HTML documents from the Internet using the
bin/post
tool from the command line.
We could also use this tool to access HTML documents locally, which
might be a more realistic scenario in a production environment.
Help, usage and examples can be found for the
bin/post
tool by adding the
-help
option.
A few points to note here before we move on. First, in the
Usage: section is a note that the
-c
option will be required to
identify which core (collection) to post the data to.
Second, in the Solr options section you could override any of the defaults with respect to where to find the core, meaning where to pass the documents.
Third, under Web crawl options: are two useful
settings. The -recursive
option
followed by an integer instructs Solr to dig down the specified number
of directories to find documents, with 1 being the
default. Also, an integer input after the
-delay
option is a courteous way to
treat a server on the other end, by waiting a number of seconds between
http requests.
Fourth, while we are here, take a look at the long and impressive list of file types that can be indexed besides HTML.
Fifth, provided at the bottom is an example of usage for a web crawl.
The post tool offers a very basic way for performing web crawls.
In production, developers typically use a more robust or custom-written
web crawling tool but the bin/post
tool may work for a simple solution and is perfectly acceptable for
a crawl of local documents on filesystems.
Also, be mindful of the website publisher's copywritten material. For me here, because I want to avoid problems with data rights and licensing issues, for this illustration I will point to FactorPad's own HTML webpage for Solr content.
I won't do this now, but the following line would traverse the named directory and one below, with a delay of 10 seconds between each http request, pick up the HTML documents and index them in the solrhelp core.
The -filetypes html
option tells
Solr to anticipate html content and the
/
on the end tells Solr that you are
pointing to a directory instead of a single file. Also, if you are
curious, when scraping web pages this tool will access the website's
robots.txt file first to verify that it isn't off
limits to web search engines and crawlers.
The directory tech/solr
on
factorpad.com holds both tutorial and reference
materials on Solr in HTML format. Currently there is 1 document
there and another 20 or so in the subdirectories. For this exercise
I will just pick up one HTML file to simplify things for Step 3, when we
review search results.
As you will see, Solr, using the default "Schemaless" configuration will not perform the search-engine quality full-text indexing that you may expect, without further modifications to the schema. That said, let's grab that HTML document so I can show you why.
With the HTML indexed, let's see what happens when we play with search queries.
In Step 3, we will use the Solr Admin UI to query this index. We explored each of the parameters in the last tutorial and also performed searches from the command line as well, so head back there for an introduction, if you have questions.
So many of us would assume that once we posted an HTML file to the index, the search functionality would work like a Google search. We know that by default the search will return the first 10 documents, just like Google, so let's give this a try.
The source document https://factorpad.com/tech/solr/index.html is a short file, so take a moment to look at it in conjuntion with our search. We see terms like enterprise search and website search. We will do two queries here and see the output. The first will be using a term I know does not exist on this page, "apple" and the second will be using the term "website", which we know does exist.
So first, let's type apple
in the
q parameter and hit the
Execute Query button.
As you can see from the json formatted response, Solr returns a reponseHeader repeating the parameters and a response with 0 records indicating that apple was not found in the index. This is as expected.
Now, let's try the term we know exists in the document, and type
website
in the q
parameter and hit the Execute Query button again.
Here we have, I think, unexpected results. Let's walk through a few observations. First, there are about 30 fields here. As expected, the first field is the url or the link to the document. The next field we have seen before titled id is the unique record in the Solr index.
The rest of the output shows fields that Lucene created on its own. The field x_parsed_by refers to a tool that Lucene uses to parse the HTML tags. I'm not sure we want that in our index. Then we have few logical meta fields that came from the <head> section, which exists in almost every HTML document. This may be nice to have, but some surely could be removed. There are keywords, author, content encoding, description and the title of the page. After that are a series of what are called Copy Fields and I will show you those in a minute.
So the takeaway for now is that while we may have expected to see a list of terms or phrases in an inverted index of terms, what the Solr Lucene indexers did was find as many fields as it could recognize. This is because we are using the "Schemaless" configuration that creates fields and modifies the schema on the fly. In a production environment, and after some experience, you will want to modify the schema so it captures the data you want, otherwise the index will get too large, which will impact speed.
The next logical move in Step 4 is to talk about schema, or the configuration file that tells Solr how to index documents and create fields and field types you want.
In our first data set we used structured data like that found in a database, so when sending documents to the core it was much easier to capture fields. Now with unstructured, or full-text data, this is a bit more of a challenge.
As mentioned in previous tutorials, Solr offers two schema files that
do the same thing. One is called
managed-schema
and the other is called
schema.xml
. They are both in
xml format.
Here, like most beginners, we are using a "Schemaless" configuration because it is a good way to test things out in a local environment. However, to create a website search box that is useful for visitors, the developer must get comfortable with editing the schema.
And "Schemaless" doesn't mean there is no schema, it simply means that
the file called managed-schema
adapts
as new documents are submitted. This option can be switched
off of course, but it is made available for beginners until they gain
a comfort level with the settings. I should mention that
managed-schema
should only be modified
with the Solr Admin UI or the Solr Schema API, to prevent us from
making mistakes.
The second type of schema file called
schema.xml
can be hand-edited by
experienced developers.
Before I performed Step 2 and posted documents to the core, out of
curiosity, I reviewed the size of the
managed-schema
file and at default it was about 50,000 bytes. Looking at the location
and size after the post, and now the file is about half the size of the
original.
So the managed-schema
file
is quite long, about 500 lines, and it is easy to get lost, so instead
of opening the actual file, Solr offers a tool using the Solr Schema
API with the curl
command to review the
fields and field types automatically created.
Okay, so this mirrors what we saw in the output of the query. Now let's look at the list of Copy Fields I mentioned.
I bring this up to illustrate the point that there is a lot going on with the "Schemaless" configuration and its "field guessing" operation. As you can tell we have a lot of fields here, and many of these may not be necessary, wasting space and slowing down our search. So this is more proof that the "Schemaless" configuration is not meant for production.
So where do we go from here? Well, we need to start understanding what is inside that schema, so I'll introduce the topic of field analysis next and then devote the whole next tutorial to it.
So in Step 5, a few quick words about the general topic of document analysis, which is typically discussed in the context of Lucene, which is the engine behind the interface of Solr.
Think of the indexing process, or ingestion of documents as a sequential process, often called analysis. In the next tutorial we will look at a snippet from the schema, but for now this is basically how the process works.
First off, the schema identifies Fields and fieldTypes. Each Field is assigned a fieldType and processing rules apply to each fieldType. Some rules are simple, like once you identify a boolean true or false, processing is easy, it is one or the other. The interpretation of text fields is more difficult. Remember we are using a computer to interpret and categorize the human language and that is what the analysis process is all about.
The best way to describe this at a high level is with the Solr Admin
UI under the Analysis tab. In the dropdown, select the
text_en, which is one of many
FieldTypes configured in our
managed-schema
file. This one relates
to the English language and has six layers of analysis.
Now in the Field Value (Index) box type
Apple's success is because Apples' coders
ate apples.
Uncheck the Verbose Output and then
hit the Analyze Values button. The table provides a
list of the sequential steps in the analysis process.
Step | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
Original | Apple's | success | is | because | Apples' | coders | ate | apples. |
ST | Apple's | success | is | because | Apples | coders | ate | apples |
SF | Apple's | success | because | Apples | coders | ate | apples | |
LCF | apple's | success | because | apples | coders | ate | apples | |
EPF | apple | success | because | apples | coders | ate | apples | |
SKMF | apple | success | because | apples | coders | ate | apples | |
PSF | appl | success | becaus | appl | coder | at | appl |
So each step of the way there is a class of code written to process text and hand it off to the next one until it is finished. The line at the bottom is what goes into the index.
I suggest spending a bit of time and think about the logic of each step. Don't sweat the details if it isn't making sense, I will pick up right where we are leaving off here and devote the whole next tutorial to the analysis process, including field analyzers, tokenizers and filters.
As you can see, we are starting to enter the world of analysis, which is where you can start thinking about your business needs for your search application. I hope that gets you excited because I know many of you are starting to plan your website search tool as a replacement for Google Site Search.
With that, you have pretty much seen a start-to-finish case of indexing with a web crawl by creating a core, posting HTML, searching, reviewing the schema and finally touching on the language analysis process of text fields with Lucene and Solr.
Yes, there are many aspects to creating a useful search tool and I'm here to help if you need a customized solution. So please feel free to reach out to me.
Q: What is the
Field Value (Query) box in the
Analysis tab of the Solr Admin UI for?
A: Analyzers are used both during the indexing
step and the searching step. You may want text processed
differently for users of your website search box and more advanced
customizations during indexing time.
Why wouldn't you connect on at YouTube? Twitter? Email list? It's free no-strings learning.
/ factorpad.com / tech / solr / tutorial / solr-web-crawl.html
A newly-updated free resource. Connect and refer a friend today.