Lauren's Blog

stop and smell the roses

Portfolio Assignment 7 March 24, 2009

Filed under: Data Mining — Lauren @ 6:02 pm

PCI Chapter 4 – Searching and Ranking

Everyone knows the most popular search engine in the world is Google with the help of the PageRank algorithm. Chapter 6 creates a search engine by collecting documents by crawling, indexing the locatings of different words and finally ranking pages to return to a user as a list. Google and other search engines are so fast, it’s hard to think of all the word that goes into a query. Until now, I’ve never really thought about what happens behind the scenes; I just Google it!

A Simple Crawler

The built-in page downloader library, urllib2, was easy to see in action. It downloaded an HTML page and can print out characters throughout the page, at different locations and ranges. urllib2 used in combination with BeautifulSoup will parse HTML and XML documents that are poorly written. I have to download and install BeautifulSoup, but to do that, I had to download WinZip so I can extract the tar.gz file that BeautifulSoup downloads for installation. I put the BeautifulSoup.py file in my Python directory and got it working.

The idea of “crawling” through the Internet completely baffles me. The Internet is such a huge and ever expanding entity that sometimes I think Google is magic. However, after entering the crawl method to searchengine.py, I have a greater understanding of how it is done. It is a pretty cleaver algorithm! Instead of testing the crawl method on kiwitobes, I used the real Wikipedia entry for Perl:

>>> pagelist = [http://en.wikipedia.org/wiki/Perl’]

>>> crawler = searchengine.crawler(”)

>>> crawler.crawl(pagelist)

Could not open http://en.wikipedia.org/wiki/Perl

Hmm, OK let’s try something different, like the UMW homepage. Now we’re talking!

>>> pagelist = [http://www.umw.edu’]

>>> crawler = searchengine.crawler(”)

>>> crawler.crawl(pagelist)

Indexing http://www.umw.edu

Indexing http://www.umw.edu/about/administration

Indexing http://www.umw.edu/featuredfaculty/stull

Indexing http://www.umw.edu/news

Indexing http://www.umw.edu/news/?a=1648

Indexing http://strategicplanning.umwblogs.org

Indexing http://www.umw.edu/academics

Indexing http://www.umw.edu/athletics

Indexing http://www.umw.edu/events

Indexing http://www.umw.edu/about

Indexing http://umwblogs.org

Indexing http://www.umw.edu/azindex

It still ends in an error… coming back to this later.

I installed pysqlite for Python 2.6 and got it working smoothly. I like how you use python strings to execute SQL queries to manipulate the database! It took me a while to get the database schema set up because there were a lot of typos and it kept giving me errors like “table urllist already exists”. So I renamed the database and got it working. The crawler class is complete and ready to test, if I can find a website that will work for it!

Advertisements
 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s