FactorPad
Build a Better Process

Crawl Websites with Apache Solr

Here we put it all together: create a core, crawl websites, post documents, search and analyze output.
  1. A new core - Create a core called solrhelp.
  2. Post HTML - Use the post tool to index HTML using a web crawl.
  3. Search - Do a search query in the Solr Admin UI and evaluate results.
  4. Review schema - Review fields and field types created by a "Schemaless" configuration.
  5. Indexing - Introduce Lucene language analysis.
face pic by Paul Alan Davis, CFA
Updated: February 25, 2021
So that's search from start to finish, now let's walk through each step.

Outline Back Tip Next

/ factorpad.com / tech / solr / tutorial / solr-web-crawl.html


An ad-free and cookie-free website.


Build a Web Crawl Search Tool in Apache Solr

Beginner

If you have been with us since the start, here all of our hard work comes together. We go from start to finish, capturing the key components of previous tutorials. As with our previous dataset called films, we will continue with a simple structure, so a core in Standalone mode. And because it is easier to build a test environment on a local installation instead of a production server, we will leave topics like SolrCloud mode for later. What is different is that instead of using structured data we will incorporate unstructured data from a website crawl.

If you are new to our Solr Tutorial series then some terms I use may not make sense and if this project interests you, I suggest heading to the start. We somewhat painstakingly went through the details earlier and won't spend much time doing so here.

What is the opportunity?

So why is website search so important in 2017? Well, if you noticed it is becoming more and more of a search world. The quality of results from the search box on many websites is a key differentiator. Long gone are the days when web surfers would navigate menus and a directory structures, there's just too much competition out there. Could you imagine asking users to navigate menus on Amazon or eBay, social posts on Instagram or jobs on LinkedIn? It just doesn't happen like that, which is why search is so important.

Of course many use Google to find everything. For custom website search, developers do have a few options, but many selected to take the easy route with Google Site Search because the quality of results matches what people are accustomed to at Google.com. In 2017, this option is no longer available, so many are scrambling for either a managed search offering or tools to build a custom search application with Elasticsearch or Apache Solr.

So with that as a backdrop, let's have some fun with a web crawl in Apache Solr Search.

Apache Solr in Video

Videos can also be accessed from our Apache Solr Search Playlist on YouTube (opens in a new browser window).

Solr Web Crawl - Crawl Websites and Search in Apache Solr (17:12)

For Those Just Starting Out

Step 1 - Create a Core in Standalone Mode with a Schemaless Configuration

Moving on to Step 1, we will focus on a core in Standalone mode as opposed to a collection in the distributed SolrCloud mode. Also, we will use what is called a "Schemaless" configuration which allows the Solr Lucene combination of programs to interpret and modify the schema as we send HTML documents to be indexed.

Get oriented from the installation directory

We are sitting in the installation directory and in my case this is solr-7.0.0 and we will access the command line tools from here.

$ pwd; ls -og /home/paul/solr-7.0.0 total 1464 drwxr-xr-x 3 4096 Oct 11 09:16 bin -rw-r--r-- 1 722808 Sep 8 12:36 CHANGES.txt drwxr-xr-x 11 4096 Sep 8 13:21 contrib drwxr-xr-x 4 4096 Oct 1 11:22 dist drwxr-xr-x 3 4096 Oct 2 19:21 docs drwxr-xr-x 7 4096 Oct 3 22:48 example drwxr-xr-x 2 32768 Oct 1 11:22 licenses -rw-r--r-- 1 12646 Sep 8 12:34 LICENSE.txt -rw-r--r-- 1 655812 Sep 8 12:36 LUCENE_CHANGES.txt -rw-r--r-- 1 24831 Sep 8 12:34 NOTICE.txt -rw-r--r-- 1 7271 Sep 8 12:34 README.txt drwxr-xr-x 11 4096 Oct 1 11:55 server

We can see two directories relevant to us, bin and server.

$ ls -og bin server bin: total 200 drwxr-xr-x 2 4096 Sep 8 12:34 init.d -rwxr-xr-x 1 12694 Sep 8 12:34 install_solr_service.sh -rwxr-xr-x 1 1255 Sep 8 12:34 oom_solr.sh -rwxr-xr-x 1 8209 Sep 8 12:34 post -rwxr-xr-x 1 74749 Sep 8 12:36 solr -rw-r--r-- 1 5 Oct 11 09:16 solr-8983.pid -rwxr-xr-x 1 68007 Sep 8 12:36 solr.cmd -rwxr-xr-x 1 6831 Sep 8 12:34 solr.in.cmd -rwxr-xr-x 1 7314 Sep 8 12:34 solr.in.sh server: total 180 drwxr-xr-x 2 4096 Oct 1 11:22 contexts drwxr-xr-x 2 4096 Oct 1 11:22 etc drwxr-xr-x 3 4096 Oct 1 11:22 lib drwxr-xr-x 3 4096 Oct 11 09:16 logs drwxr-xr-x 2 4096 Oct 1 11:22 modules -rw-r--r-- 1 3977 Sep 8 12:34 README.txt drwxr-xr-x 2 4096 Oct 1 11:22 resources drwxr-xr-x 3 4096 Sep 8 12:34 scripts drwxr-xr-x 4 4096 Oct 20 10:41 solr drwxr-xr-x 3 4096 Sep 8 13:21 solr-webapp -rw-r--r-- 1 142488 Oct 28 2016 start.jar

The bin directory includes two scripts we will need. First, the bin/solr script which has 12 commands used to manage the server instance and build cores. Second is the bin/post tool used to post documents like web pages and create the index.

Once built, the core and all of its data will sit in the server directory and inside the solr sub-directory.

Check the status of the Solr instance

We can use the bin/solr status command to see that we have a server node up and running.

$ bin/solr status Found 1 Solr nodes: Solr process 5700 running on port 8983 { "solr_home":"/home/paul/solr-7.0.0/server/solr", "version":"7.0.0 3ba304b2826a92349c51457d9f8 - anshum - 2017-09-08 13:21:08", "startTime":"2017-10-11T16:16:15.224Z", "uptime":"16 days, 5 hours, 4 minutes, 37 seconds", "memory":"46.5 MB (%9.5) of 490.7 MB"}

For newcomers, you can go back a few tutorials to see how we did this, or use the bin/solr start command to start an instance with all of the default settings.

In our next step when we build a core it will sit in solr_home identified in this output, which is the home to all Solr cores on this instance.

Create a Core Using the bin/solr create_core command

Now to build a core with the simpleist of settings, we just need to point to the bin/solr create_core command and give it a unique name after the -c flag. We will call it solrhelp.

$ bin/solr create_core -c solrhelp WARNING: Using _default configset. Data driven schema functionality is enabled by default, which is NOT RECOMMENDED for production use. To turn it off: curl http://localhost:8983/solr/solrhelp/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}' Copying configuration to new core instance directory: /home/paul/solr-7.0.0/server/solr/solrhelp Creating new core 'solrhelp' using command: http://localhost:8983/solr/admin/cores?action=CREATE&name=solrhelp&instanceDir=solrhelp { "responseHeader":{ "status":0, "QTime":252}, "core":"solrhelp"}

You can review the output later, but to summarize what happened, Solr copied default configuration files from a directory called _default to this new core called solrhelp in a subdirectory called conf. This output also offers a gentle reminder that configuration files need to be customized before heading to a production environment. We will revisit this topic shortly.

Step 2 - Use the Post Tool to Index HTML from a Web Crawl

For Step 2, we will post HTML documents from the Internet using the bin/post tool from the command line. We could also use this tool to access HTML documents locally, which might be a more realistic scenario in a production environment.

Access documentation for the bin/post tool

Help, usage and examples can be found for the bin/post tool by adding the -help option.

$ bin/post -help Usage: post -c <collection> [OPTIONS] <files|directories|urls|-d ["...",...]> or post -help collection name defaults to DEFAULT_SOLR_COLLECTION if not specified OPTIONS ======= Solr options: -url <base Solr update URL> (overrides collection, host, and port) -host <host> (default: localhost) -p or -port <port> (default: 8983) -commit yes|no (default: yes) -u or -user <user:pass> (sets BasicAuth credentials) Web crawl options: -recursive <depth> (default: 1) -delay <seconds> (default: 10) Directory crawl options: -delay <seconds> (default: 0) stdin/args options: -type <content/type> (default: application/xml) Other options: -filetypes <type>[,<type>,...] (default: xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log) -params "<key>=<value>[&<key>=<value>...]" (values must be URL-encoded; these pass through to Solr update request) -out yes|no (default: no; yes outputs Solr response to console) -format solr (sends application/json content as Solr commands to /update instead of /update/json/docs) Examples: * JSON file: /home/paul/solr-7.0.0/bin/post -c wizbang events.json * XML files: /home/paul/solr-7.0.0/bin/post -c records article*.xml * CSV file: /home/paul/solr-7.0.0/bin/post -c signals LATEST-signals.csv * Directory of files: /home/paul/solr-7.0.0/bin/post -c myfiles ~/Documents * Web crawl: /home/paul/solr-7.0.0/bin/post -c gettingstarted http://lucene.apache.org/solr -recursive 1 -delay 1 * Standard input (stdin): echo '{commit: {}}' | /home/paul/solr-7.0.0/bin/post -c my_collection -type application/json -out yes -d * Data as string: /home/paul/solr-7.0.0/bin/post -c signals -type text/csv -out yes -d $'id,value\n1,0.47'

A few points to note here before we move on. First, in the Usage: section is a note that the -c option will be required to identify which core (collection) to post the data to.

Second, in the Solr options section you could override any of the defaults with respect to where to find the core, meaning where to pass the documents.

Third, under Web crawl options: are two useful settings. The -recursive option followed by an integer instructs Solr to dig down the specified number of directories to find documents, with 1 being the default. Also, an integer input after the -delay option is a courteous way to treat a server on the other end, by waiting a number of seconds between http requests.

Fourth, while we are here, take a look at the long and impressive list of file types that can be indexed besides HTML.

Fifth, provided at the bottom is an example of usage for a web crawl.

How would you post HTML files from a web crawl?

The post tool offers a very basic way for performing web crawls. In production, developers typically use a more robust or custom-written web crawling tool but the bin/post tool may work for a simple solution and is perfectly acceptable for a crawl of local documents on filesystems.

Also, be mindful of the website publisher's copywritten material. For me here, because I want to avoid problems with data rights and licensing issues, for this illustration I will point to FactorPad's own HTML webpage for Solr content.

I won't do this now, but the following line would traverse the named directory and one below, with a delay of 10 seconds between each http request, pick up the HTML documents and index them in the solrhelp core.

$ bin/post -c solrhelp -filetypes html https://factorpad.com/tech/solr/ -recursive 1 -delay 10

The -filetypes html option tells Solr to anticipate html content and the / on the end tells Solr that you are pointing to a directory instead of a single file. Also, if you are curious, when scraping web pages this tool will access the website's robots.txt file first to verify that it isn't off limits to web search engines and crawlers.

Post a single HTML file to the solrhelp core

The directory tech/solr on factorpad.com holds both tutorial and reference materials on Solr in HTML format. Currently there is 1 document there and another 20 or so in the subdirectories. For this exercise I will just pick up one HTML file to simplify things for Step 3, when we review search results.

As you will see, Solr, using the default "Schemaless" configuration will not perform the search-engine quality full-text indexing that you may expect, without further modifications to the schema. That said, let's grab that HTML document so I can show you why.

$ bin/post -c solrhelp -filetypes html https://factorpad.com/tech/solr/index.html

With the HTML indexed, let's see what happens when we play with search queries.

Step 3 - Query solrtest and See What Fields Were Created

In Step 3, we will use the Solr Admin UI to query this index. We explored each of the parameters in the last tutorial and also performed searches from the command line as well, so head back there for an introduction, if you have questions.

Examine the input HTML file in conjunction with a search

So many of us would assume that once we posted an HTML file to the index, the search functionality would work like a Google search. We know that by default the search will return the first 10 documents, just like Google, so let's give this a try.

The source document https://factorpad.com/tech/solr/index.html is a short file, so take a moment to look at it in conjuntion with our search. We see terms like enterprise search and website search. We will do two queries here and see the output. The first will be using a term I know does not exist on this page, "apple" and the second will be using the term "website", which we know does exist.

Enter a non-existent term and see what happens

So first, let's type apple in the q parameter and hit the Execute Query button.

{ "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"apple", "_":"1509386786455"}}, "response":{"numFound":0,"start":0,"docs":[]

As you can see from the json formatted response, Solr returns a reponseHeader repeating the parameters and a response with 0 records indicating that apple was not found in the index. This is as expected.

Enter an existing term and see what happens

Now, let's try the term we know exists in the document, and type website in the q parameter and hit the Execute Query button again.

{ "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"website", "_":"1509386786455"}}, "response":{"numFound":1,"start":0,"docs":[ { "url":["https://factorpad.com/tech/solr/index.html"], "id":"https://factorpad.com/tech/solr/index.html", "stream_size":["null"], "x_parsed_by":[org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.html.HtmlParser"], "stream_content_type":["text/html"], "keywords":["apache solr, solr, solr search, solr tutorial, solr reference, website search, enterprise search, FactorPad tutorials"], "viewport":["width=device-width, initial-scale=1.0"], "dc_title":["Learn enterprise search and website search with Apache Solr"], "author":["FactorPad LLC"] "content_encoding":["UTF-8"], "content_type_hint":["text/html; charset=UTF-8"], "description":["Learn web development and enterprise search in Apache Solr with free tutorials and a reference at FactorPad."], "title":["Learn enterprise search and website search with Apache Solr"], "content_type":["text/html, charset=UTF-8"], "stream-size_str":["null"], "url_str":["https://factorpad.com/tech/solr/index.html"], "dc_title_str":["Learn enterprise search and website search with Apache Solr"], "x_parsed_by_str":["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.html.HtmlParser"], "description_str":["Learn web development and enterprise search in Apache Solr with free tutorials and a reference at FactorPad."] "content_type_str":["text/html; charset=UTF-8"], "content_type_hint_str":["text/html; charset=UTF-8"], "stream_content_type_str":[text/html"], "viewport_str":["width=device-width, initial-scale=1.0"], "title_str":["Learn enterprise search and website search with Apache Solr"], "keywords_str":["apache solr, solr, solr search, solr tutorial, solr reference, website search, enterprise search, FactorPad tutorials"], "author_str":["FactorPad LLC", "_version_":1582472164235280384, "content_encoding_str":["UTF-8"]}] }}

Here we have, I think, unexpected results. Let's walk through a few observations. First, there are about 30 fields here. As expected, the first field is the url or the link to the document. The next field we have seen before titled id is the unique record in the Solr index.

The rest of the output shows fields that Lucene created on its own. The field x_parsed_by refers to a tool that Lucene uses to parse the HTML tags. I'm not sure we want that in our index. Then we have few logical meta fields that came from the <head> section, which exists in almost every HTML document. This may be nice to have, but some surely could be removed. There are keywords, author, content encoding, description and the title of the page. After that are a series of what are called Copy Fields and I will show you those in a minute.

So the takeaway for now is that while we may have expected to see a list of terms or phrases in an inverted index of terms, what the Solr Lucene indexers did was find as many fields as it could recognize. This is because we are using the "Schemaless" configuration that creates fields and modifies the schema on the fly. In a production environment, and after some experience, you will want to modify the schema so it captures the data you want, otherwise the index will get too large, which will impact speed.

Step 4 - Review Resulting Fields and Field Types in the Schema

The next logical move in Step 4 is to talk about schema, or the configuration file that tells Solr how to index documents and create fields and field types you want.

In our first data set we used structured data like that found in a database, so when sending documents to the core it was much easier to capture fields. Now with unstructured, or full-text data, this is a bit more of a challenge.

The "Schemaless" configuration, schema.xml and managed-schema

As mentioned in previous tutorials, Solr offers two schema files that do the same thing. One is called managed-schema and the other is called schema.xml. They are both in xml format.

Here, like most beginners, we are using a "Schemaless" configuration because it is a good way to test things out in a local environment. However, to create a website search box that is useful for visitors, the developer must get comfortable with editing the schema.

And "Schemaless" doesn't mean there is no schema, it simply means that the file called managed-schema adapts as new documents are submitted. This option can be switched off of course, but it is made available for beginners until they gain a comfort level with the settings. I should mention that managed-schema should only be modified with the Solr Admin UI or the Solr Schema API, to prevent us from making mistakes.

The second type of schema file called schema.xml can be hand-edited by experienced developers.

What are the fields and field types in the schema after the post?

Before I performed Step 2 and posted documents to the core, out of curiosity, I reviewed the size of the managed-schema file and at default it was about 50,000 bytes. Looking at the location and size after the post, and now the file is about half the size of the original.

$ ls -og server/solr/solrhelp/conf/managed-schema -rw-r--r-- 1 29773 Oct 27 19:16 server/solr/solrhelp/conf/managed-schema

So the managed-schema file is quite long, about 500 lines, and it is easy to get lost, so instead of opening the actual file, Solr offers a tool using the Solr Schema API with the curl command to review the fields and field types automatically created.

$ curl http://192.168.0.8:8983/solr/solrhelp/schema/fields { "responseHeader":{ "status":0, "QTime":0}, "fields":[{ "name":"_root_", "type":"string", "docValues":false, "indexed":true, "stored":false}, { "name":"_text_", "type":"text_general", "multiValued":true, "indexed":true, "stored":false}, { "name":"_version_", "type":"plong", "indexed":false, "stored":false}, { "name":"author", "type":"text_general"}, { "name":"content_encoding", "type":"text_general"}, { "name":"content_type", "type":"text_general"}, { "name":"content_type_hint", "type":"text_general"}, { "name":"dc_title", "type":"text_general"}, { "name":"description", "type":"text_general"}, { "name":"id", "type":"string", "multiValued":false, "indexed":true, "required":true, "stored":true}, { "name":"keywords", "type":"text_general"}, { "name":"stream_content_type", "type":"text_general"}, { "name":"stream_size", "type":"text_general"}, { "name":"title", "type":"text_general"}, { "name":"url", "type":"text_general"}, { "name":"viewport", "type":"text_general"}, { "name":"x_parsed_by", "type":"text_general"}]}

Okay, so this mirrors what we saw in the output of the query. Now let's look at the list of Copy Fields I mentioned.

$ curl http://192.168.0.8:8983/solr/solrhelp/schema/copyfields { "responseHeader":{ "status":0, "QTime":0}, "copyFields":[{ "source":"stream_size", "dest":"stream_size_str", "maxChars":256}, { "source":"stream_content_type", "dest":"stream_content_type_str", "maxChars":256}, { "source":"keywords", "dest":"keywords_str", "maxChars":256}, { "source":"author", "dest":"author_str", "maxChars":256}, { "source":"x_parsed_by", "dest":"x_parsed_by_str", "maxChars":256}, { "source":"content_encoding", "dest":"content_encoding_str", "maxChars":256}, { "source":"content_type_hint", "dest":"content_type_hint_str", "maxChars":256}, { "source":"description", "dest":"description_str", "maxChars":256}, { "source":"title", "dest":"title_str", "maxChars":256}, { "source":"url", "dest":"url_str", "maxChars":256}, { "source":"content_type", "dest":"content_type_str", "maxChars":256}, { "source":"viewport", "dest":"viewport_str", "maxChars":256}, { "source":"dc_title", "dest":"dc_title_str", "maxChars":256}]}

I bring this up to illustrate the point that there is a lot going on with the "Schemaless" configuration and its "field guessing" operation. As you can tell we have a lot of fields here, and many of these may not be necessary, wasting space and slowing down our search. So this is more proof that the "Schemaless" configuration is not meant for production.

So where do we go from here? Well, we need to start understanding what is inside that schema, so I'll introduce the topic of field analysis next and then devote the whole next tutorial to it.

Step 5 - Introduce Lucene analyzers, tokenizers and filters

So in Step 5, a few quick words about the general topic of document analysis, which is typically discussed in the context of Lucene, which is the engine behind the interface of Solr.

Think of the indexing process, or ingestion of documents as a sequential process, often called analysis. In the next tutorial we will look at a snippet from the schema, but for now this is basically how the process works.

First off, the schema identifies Fields and fieldTypes. Each Field is assigned a fieldType and processing rules apply to each fieldType. Some rules are simple, like once you identify a boolean true or false, processing is easy, it is one or the other. The interpretation of text fields is more difficult. Remember we are using a computer to interpret and categorize the human language and that is what the analysis process is all about.

The best way to describe this at a high level is with the Solr Admin UI under the Analysis tab. In the dropdown, select the text_en, which is one of many FieldTypes configured in our managed-schema file. This one relates to the English language and has six layers of analysis.

Now in the Field Value (Index) box type Apple's success is because Apples' coders ate apples. Uncheck the Verbose Output and then hit the Analyze Values button. The table provides a list of the sequential steps in the analysis process.

Step 1 2 3 4 5 6 7 8
Original Apple's success is because Apples' coders ate apples.
ST Apple's success is because Apples coders ate apples
SF Apple's success   because Apples coders ate apples
LCF apple's success   because apples coders ate apples
EPF apple success   because apples coders ate apples
SKMF apple success   because apples coders ate apples
PSF appl success   becaus appl coder at appl

So each step of the way there is a class of code written to process text and hand it off to the next one until it is finished. The line at the bottom is what goes into the index.

I suggest spending a bit of time and think about the logic of each step. Don't sweat the details if it isn't making sense, I will pick up right where we are leaving off here and devote the whole next tutorial to the analysis process, including field analyzers, tokenizers and filters.

Summary

As you can see, we are starting to enter the world of analysis, which is where you can start thinking about your business needs for your search application. I hope that gets you excited because I know many of you are starting to plan your website search tool as a replacement for Google Site Search.

With that, you have pretty much seen a start-to-finish case of indexing with a web crawl by creating a core, posting HTML, searching, reviewing the schema and finally touching on the language analysis process of text fields with Lucene and Solr.

Yes, there are many aspects to creating a useful search tool and I'm here to help if you need a customized solution. So please feel free to reach out to me.


Related Solr Reference Material


Questions and Answers

Q:  What is the Field Value (Query) box in the Analysis tab of the Solr Admin UI for?
A:  Analyzers are used both during the indexing step and the searching step. You may want text processed differently for users of your website search box and more advanced customizations during indexing time.


What's Next?

Why wouldn't you connect on at YouTube? Twitter? Email list? It's free no-strings learning.

Outline Back Tip Next

/ factorpad.com / tech / solr / tutorial / solr-web-crawl.html


solr examples
apache solr
solr web crawl
solr post
solr core
solr index
lucene query syntax
search websites
solr analyzer
solr tokenizer
solr filters
lucene analyzer

A newly-updated free resource. Connect and refer a friend today.