Friday, April 08, 2005

The hamsap search engine

It appears that my adventure with hamsap words (see previous post) is not paying off yet. All visitors tracked by my counter appear to be legitimate blog-crawlers. Has Google updated my page in its cache yet?

Time for a little investigation. The first step would be to see if my blog address and some kinky words will elicit a hit from Google. The search strings tanyeewei naked and tanyeewei boobs came up with null results. Believe me, I cringed and squirmed uncomfortably when I typed those in. Nevertheless, I was relieved to see no results. I cannot imagine my horror if those incriminating images of me naked with F-cup boobs were available on the net.

Clearly, the Google spiders have not crawled by and dropped in at my page for some days. By searching my name with entry-specific words on Google, it appears that the last time any of those adorable arachnids stopped by for tea, scones and information was on the 29th of March. If one is inclined to update a webpage on Google, it can be done by inviting the spiders over for an information crawl from here. The entry would be saved in a list of target pages for the crawlers to search.

I’ll also share a little of what I’ve learned about search engines (especially Google) over the 100 minutes or so.

It is estimated that tens of millions of searches are performed everyday. (How Stuff Wroks)Looking at the numbers, it will not be very efficient to search the web for specific objects ten million times each day, as and when users demand it. Instead, major parts of the web have already been browsed by the search engine and part of the content copied into a database. That way, the search engine only needs search its own database to point you in the right direction. (Try searching some random words on Google. Most search hits are accompanied on the bottom line by a little link named “cached”. You can click on this to see the image that was saved by the search engine crawler.)

In its never ending quest to map the perpetually changing web space, search engines employ a hoard of autonomous web surfing programs to index and archive web pages. Due to the cute manner the web is named (the World Wide Web), these programs are naturally called spiders or crawlers.

To rank the relevance of search results, various weightings (or scores) are used. Font size of the word, its frequency in the document, its appearance in the title, its appearance in linked pages and many other factors contribute to the overall weight. A page linked by many high scoring pages (say, BBC) will be deemed more important than a page with few links (this blog page, for example).

The importance of mathematics and analysis cannot be overstressed here. Just for an idea how Google Labs lure potential employees, look here.



References:
Google- about Google, Google Labs, Google
Curt Franklin- How Internet Search Engines Work
Ed Pegg Jr, Eric W. Weisstein et al- Mathematica's Google Aptitude