A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to assearch engine results pages (SERPs). The information may be a specialist in web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.
There are three basic stages for a search engine:
- crawling –where content is discovered
- indexing –where it is analysed and stored in huge databases
- retrieval –where a user query fetches a list of relevant pages.
Crawling is where it all begins – the acquisition of data about a website. This involves scanning the site and getting a complete list of everything on there – the page title, images, keywords it contains, and any other pages it links to – at a bare minimum. Modern crawlers may cache a copy of the whole page, as well as look for some additional information such as the page layout, where the advertising units are, where the links are on the page (featured prominently in the article text, or hidden in the footer?).
How is a website crawled exactly? An automated bot – a spider – visits each page, just like you or I would, only very quickly. Even in the earliest days, Google reported that they were reading a few hundred pages a second. If you’d like to learn how to make your own basic web crawler in PHP – it was one of the first articles I wrote here and well worth having a go at (just don’t expect to make the next Google).
The crawler then adds all the new links it found to a list of places to crawl next – in addition to re-crawling sites again to see if anything has changed. It’s a never-ending process, really.
Any site that is linked to from another site already indexed, or any site that manually asked to be indexed, will eventually be crawled – some sites more frequently than others and some to a greater depth. If the site is huge and content hidden many clicks away from the homepage, the crawler bots may actually give up. There are ways to ask search engines NOT to index a site, though this is rarely used to block an entire website.
There was even a time when large parts of the Internet were essentially invisible to search engines – the so-called “deep web” – but this is rare now. TOR-hosted websites (What is Onion Routing?) for example, remain unindexed by Google, and are only accessible by connecting to the TOR network and knowing the address.
You’d be forgiven for thinking this is an easy step – indexing is the process of taking all of that data you have from a crawl, and placing it in a big database. Imagine trying to a make a list of all the books you own, their author and the number of pages. Going through each book is the crawl and writing the list is the index. But now imagine it’s not just a room full of books, but every library in the world. That’s pretty much a small-scale version of what Google does.
All of this data is stored in vast data-centres with thousands of petabytes worth of drives. Here’s a sneaky peak inside one of Google’s:
Ranking & Retrieval
The last step is what you see – you type in a search query, and the search engine attempts to display the most relevant documents it finds that match your query. This is the most complicated step, but also the most relevant to you or I, as web developers and users. It is also the area in which search engines differentiate themselves (though, there was some evidence that Bing was actually copying some Google results). Some work with keywords, some allow you to ask a question, and some include advanced features like keyword proximity or filtering by age of content.
The ranking algorithm checks your search query against billions of pages to determine how relevant each one is. This operation is so complex that companies closely guard their own ranking algorithms as patented industry secrets. Why? Competitive advantage for a start – so long as they are giving you the best search results, they can stay on top of the market. Secondly, to prevent gaming of the system and giving an unfair advantage to one site over another.
Once the internal methodology of any system is fully understood, there will always be those who try to “hack” it – discover the ranking factors and exploit them for monetary gain.
Exploiting the ranking algorithm has in fact been commonplace since search engines began, but in the last 3 years or so Google has really made that difficult. Originally, sites were ranked based on how many times a particular keyword was mentioned. This led to “keyword stuffing”, where pages are filled with mostly nonsense so long as it includes the keyword everywhere.
Then the concept of importance based on linking was introduced – more popular sites would be more linked to, obviously – but this led to a proliferation of spammed links all over the web. Now each link is determined to have a different value, depending on the “authority” of the site in question. If a high level government agency links to you, it’s worth far more than a link found in a free-for-all “link directory”.