Friday, October 06, 2006

Spiders crawling all over the web...


JavaWorld.com has a very interesting article on creating web spiders. The article is pretty technical, however, after parcing through it there is alot of useful information to be gleamed. Using a web spider is kind of like using google. However, as I understand it, the information returned by a webspider follows a directed path of links starting from the "root"; whatever website you choose to have the spider start following links from.

At the bottom of the article there is a downloadable Demo program of a spider. Very fun program to play around with. Once I became used to the advantages/disadvantages of using various breadth and depth settings, I was able to return some intereseting results.

For instance, if you set the maximum search depth to "100", this particular program will follow each link until it has travelled 100 links from the root. At that point it will start the "breadth" field of the search, which involves travelling along each of the links found on each website, until it can travel no farther.

So, it seems as though the Demo Spider has two modes: first, the depth mode, it travels along the first link it encounters until it can not travel any farther, or it reaches the maximum number of sites to travel along. Following this, it "backtracks" to each site, and explores any other links available, until there are no more, or it reaches the maximum depth specified.

As a tool, this strikes me as extremely useful since it searches the entire website for you. Rather than try to search through a site map for specific content and links on a website, one can make the spider do it for you! If one wanted to create a database of links on a particular subject, this program performs an exhaustive search. Granted, you have to go through the material yourself. However, there is an added tool in the program to make that task much easier.

You can enter keywords for it to pay attention to, so the program will highlite any webpages that match the criteria that you specify. This is not like a key word search on google as it does not isolate the sites that return the keyword. However, it is useful for "seeing" the shape of the net.

No comments: