Lauren's Blog

stop and smell the roses

Portfolio Assignment 7 March 24, 2009

Filed under: Data Mining — Lauren @ 6:02 pm

PCI Chapter 4 – Searching and Ranking

Everyone knows the most popular search engine in the world is Google with the help of the PageRank algorithm. Chapter 6 creates a search engine by collecting documents by crawling, indexing the locatings of different words and finally ranking pages to return to a user as a list. Google and other search engines are so fast, it’s hard to think of all the word that goes into a query. Until now, I’ve never really thought about what happens behind the scenes; I just Google it!

A Simple Crawler

The built-in page downloader library, urllib2, was easy to see in action. It downloaded an HTML page and can print out characters throughout the page, at different locations and ranges. urllib2 used in combination with BeautifulSoup will parse HTML and XML documents that are poorly written. I have to download and install BeautifulSoup, but to do that, I had to download WinZip so I can extract the tar.gz file that BeautifulSoup downloads for installation. I put the BeautifulSoup.py file in my Python directory and got it working.

The idea of “crawling” through the Internet completely baffles me. The Internet is such a huge and ever expanding entity that sometimes I think Google is magic. However, after entering the crawl method to searchengine.py, I have a greater understanding of how it is done. It is a pretty cleaver algorithm! Instead of testing the crawl method on kiwitobes, I used the real Wikipedia entry for Perl:

>>> pagelist = [http://en.wikipedia.org/wiki/Perl’]

>>> crawler = searchengine.crawler(”)

>>> crawler.crawl(pagelist)

Could not open http://en.wikipedia.org/wiki/Perl

Hmm, OK let’s try something different, like the UMW homepage. Now we’re talking!

>>> pagelist = [http://www.umw.edu’]

>>> crawler = searchengine.crawler(”)

>>> crawler.crawl(pagelist)

Indexing http://www.umw.edu

Indexing http://www.umw.edu/about/administration

Indexing http://www.umw.edu/featuredfaculty/stull

Indexing http://www.umw.edu/news

Indexing http://www.umw.edu/news/?a=1648

Indexing http://strategicplanning.umwblogs.org

Indexing http://www.umw.edu/academics

Indexing http://www.umw.edu/athletics

Indexing http://www.umw.edu/events

Indexing http://www.umw.edu/about

Indexing http://umwblogs.org

Indexing http://www.umw.edu/azindex

It still ends in an error… coming back to this later.

I installed pysqlite for Python 2.6 and got it working smoothly. I like how you use python strings to execute SQL queries to manipulate the database! It took me a while to get the database schema set up because there were a lot of typos and it kept giving me errors like “table urllist already exists”. So I renamed the database and got it working. The crawler class is complete and ready to test, if I can find a website that will work for it!

Advertisements
 

Portfolio Assignment 6

Filed under: Data Mining — Lauren @ 5:57 pm

Clustering Movies

For this assignment, my team (Andrew, Kurt, Will) and I tried to cluster a very large file with movie data. Once we got the text file to work in the readfile method, we ran it on my computer, waited, waited and waited. We knew ahead of time that it would take a while to cluster so we ran it during class (approximately 2.5 hours) and still nothing. 

Not knowing what to do next, I browsed my classmate’s blogs to see how they approached this movie data. The idea I tried next was to get rid of most of the column, except for two, and narrow the data down to 1,000 movies. I let it run for 5 minutes or so and finally got the Python command prompt back! But then I keep getting this error when I try to print the clusters out:

>>> movienames,categories,data=moviecluster.readfile(‘moviedata.txt’)

>>> clust = moviecluster.hcluster(data)

>>> moviecluster.printclust(clust,labels=movienames)

  Starman

Traceback (most recent call last):

  File “<pyshell#14>”, line 1, in <module>

    moviecluster.printclust(clust,labels=movienames)

  File “C:\Python26\moviecluster.py”, line 101, in printclust

    if clust.right!=None: printclust(clust.right,labels=labels,n=n+1)

  File “C:\Python26\moviecluster.py”, line 91, in printclust

    if clust.id<0:

AttributeError: ‘list’ object has no attribute ‘id’

>>> 

 

Portfolio Assignment 5 March 15, 2009

Filed under: Data Mining — Lauren @ 5:49 pm

Visualizations

Mans Rosling’s talk at the TED conference was so cool.  The Baby Name Wizard that Professor Zacharski showed in class caught my attention. I have showed it to all my friends and typed in all of their names, which is fun. As noted in the assignment description, some visualizations are purely artistic, like this Antarctic Animation which looks cool, but I have no idea what data it is trying to analyze!

I found an interesting baseball visualization that analyzes spending and performance. Towards the end of the season in 2008, about a month before the World Series, the number one ranked team, the Angels was spending a fair amount for their performance. However, the Rays who were ranked third behind the Angels were spending considerably less than the Angels which counteracts the idea that the best teams spend the most. The good ole Washington Nats did terrible last season :( but also did not spend nearly as much as the Angels or other expensive teams like Boston and New York.

There are all kinds of visualizations listed on this blog by Meryl K. Evans.

I created a visualization on Many Eyes by uploading a dataset for the price of gas. It only goes to 2004 so I’d like to find a more recent version. This is my practice visualization: Price of Gas 1976-2004.

As a member of the UMW Women’s Soccer team, I am particularly proud of this visualization. I took all the team articles from the athletics website and combined them into one large document. I also added every member of the 2008 team. I uploaded it to Many Eyes and got a cool advertisement and summary of the 2008 season! UMW Women’s Soccer 2008 Review. It’s a work in progress, I plan on editing the dataset to remove words that refer to other teams or anything else that would take away from the season and team in general. Go Eagles!