Running a search engine

Wouldn’t it be great to get an inside look at the workings of a search engine?

You could find out why some things rank higher than others as well as getting an understanding of how and why pages are fetched from the Web, indexed and made available for search. You could also gain an appreciation of the massive amounts of data out there.

There is a way to achieve all this…

I recently spent some time installing and evaluating Nutch which is part of the Apache Lucene project. Nutch describes itself as an: “… open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.” One thing that you will notice about the project is that it is very sparing in its use of words. Nutch allows you to create an initial list of pages, fetch the pages from the Web and perform indexing. It also includes a search form.

Default Nutch search screen

What is extraordinary about this project is the amazing speed in which it fetched and indexes pages, even using an old e-Machines box on which I installed Ubuntu server.

What is also interesting is to review the results that you get when you do a search.

After you do a search (using Nutch’s default values) the results appear as follows (I added the red oval):

Default Nucth search results

If you click on the explain link, you get something like the following:

An analysis of Nutch search results

I haven’t worked out what it all means yet - and of course, this is obviously not the same algorithm used by Google, Yahoo! or MSN - but is does start to give an inkling in how search works and results are ordered.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Leave a Reply