Sunday, June 13, 2010

Caffeine : The new Search Indexing system of Google

All people who have worked or read about search engines must be pretty aware of the fact that indexing is a pretty tedious job to be done and plays a central role. At research level we generally use libraries like Beautiful Soup for indexing. But the need for a faster indexing is felt for pages which get constantly updated. Therefore search engines usually identify such pages. Web pages such as news, share markets page, etc need to be constantly updated . So these pages are crawled more oftenly as compared to other pages.
      The introduction of Caffine produces a whole different approach and promises to keep all the web pages updated. A parallel processing approach is taken and hundreds of thousands of pages are crawled every second. This leads to a fresher return of query. "Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day".
Our old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, we would analyze the entire web, which meant there was a significant delay between when we found a page and made it available to you.

With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index. That means you can find fresher information than ever before—no matter when or where it was published.s.
The image compares the old search indexing and the new one: Caffine

No comments:

Post a Comment