Lauren's Blog

stop and smell the roses

Portfolio Assignment 9 April 23, 2009

Filed under: Data Mining — Lauren @ 4:42 am

Final Project

For our final project, my team (Andrew, Kurt, Will) and I will be expanding on our work from last week with document filtering.

The Problem

As you well know, spam is a very annoying and persistant presence on the Internet. In chapter 6 of PCI, we learned that rule-based classifier don’t cut it because spammers are getting smarter. So, we created a learning classifier that is trained on data and gives a document a category depending on word or feature probabilities. The only guidelines for our project was to use a substantial dataset. The algorithm in the book uses strings as “documents”. We want to use real email documents to train the classifier and use it for future classifications.

The Data

At first, we searched the Internet for some fun spam datasets to download. Of course, there were a ton! But the way we planned to modify the classifying algorithm in the book was to use email text and we kept finding weird formats for the datasets. So, Will logged into his old Yahoo email account and found 1,400 spam emails. I’m pretty sure if I logged into my old AOL account I would find a similar number! At first we thought we were going to have to use the sampletrain method from the book and type the name of every file into a line of code. That would take forever and make the algorithm not very realistic in real life. Will whipped up a function to rename all of his emails into a format of either spam#.txt or nonspam#.txt:

def openfiles(cl):

    data = open(‘blogsplogreal.txt’, ‘r’)

    lines = data.readlines();

    for i in lines:

        thisline = i.split(” “);

        filename = thisline[1];

        print ‘opening: ‘ + filename;

        if thisline[2] == “1\n”:

            spamtype = ‘spam’;

        else:

            spamtype = ‘not-spam’;

        print ‘file type: ‘ + spamtype;

        cl.train(filename, spamtype);

This was useful because we created a loop to train the classifier by concatenating the basename of the file (spam or nonspam), the number and ‘.txt’:

def sampletrain(cl,basefile,numfiles,gory):

    for i in range(1,numfiles+1):

        filename =  basefile + str(i) + ‘.txt’

        #print filename

        cl.train(filename,gory)

The Solution

We used Bayes (when in doubt, use Bayes!) to train and classify documents. Starting off the books code, we had to edit the getwords method to open a file and add the words to the dictionary. A friendly neighbor in the lab showed us how to do file I/O and this is what we came up with:

 

def getwords(doc):

    data = open(doc, ‘r’)

    lines = ‘ ‘

    for line in data:

        lines+=line

    #print lines

    splitter = re.compile(‘\\W*’)

    # Split the words by non-alpha characters

    words = [s.lower() for s in splitter.split(lines) if len(s) > 2 and len(s) < 20]

    #print words

    # Return the unique set of words only

    return dict([(w,1) for w in words])

This opens a file and concatenates each line into one big string. Then it is split up and converted to lowercase as the book does. Now, we can send the classifier a filename and it will get the features and resume with the same algorithm.

The Results

We were very excited to see that our modifications to allow files to be trained compiled! It took some serious looking at the code to make sure we were doing it correctly, but now we really understand what is going on. Starting small, we trained two documents, one that was spam and one that was not. It correctly added the categories and features to the dictionary- success number 1! Then we trained a few more documents and gave it a unknown document to classify and it worked! It classified 100% of 4 document correctly. Now that we know it works, we ran the algorithm with Will’s mixed spam and nonspam files. Tomorrow we’re going to run it with a combination of unknown document and see how it classifies in front of the class.

Advertisements
 

Portfolio Assignment 8 April 11, 2009

Filed under: Data Mining — Lauren @ 8:53 pm

PCI Chapter 6 – Document Filtering

Next class, my team (Will, Andrew and Kurt) and I will be presenting Chapter 6 in PCI.  We have divided the chapter into major sections, and I will be discussing the first three, starting with Filtering Spam.

Filtering Spam

Why do we need to classify documents based on their contents? To eliminate spam! For my own Gmail account, I use many rule-based, spam-elminating, organizational methods. But, as the chapter agrees, this isn’t a perfect approach. Because I use word matching, sometimes an email that is destined for the “annoying UMW administration email” folder, ends up in my inbox (which I promptly delete). I also use email address matching to filter my messages, but some addresses could go into more than one folder, depending on the content of the message. So, what to do? How about programs that learn based on what you tell them is spam, and isn’t and continues to do this not just initially, but as you receive more email.

Documents and Words

Some words appear more frequently in spam, and therefore those words will determine whether a document is spam or not. Also, there are words that commonly show up in spam, but could also be important in an email that is not spam. getwords separates the document into words by splitting the stream when it encounters a character that is not a letter. This means that words with apostrophies are separated into separate words. For example, they’re would be come two words: they and re.

Training the Classifier

As we know, the more examples of documents with correct classifications a classifer sees, the better it will become at correctly classifying new documents. After adding the classifier class and its helper methods, I ran the code described on page 121 to check to see if it was working. Up untill now, I really didn’t understand what the class was doing with the features and categories. The example input helped me understand how this classifier will work. 

Calculating Probabilities

The probability that a words is in a particular category will make certain words more likely to show up in spam. For example, if the word ‘Viagra’ appears a lot more in the ‘bad’ category than the ‘good’ category, it has a high probability to be a spam word. A word like ‘the’ is probably not a good spam classifier because it is so common:

>>> cl.fprob(‘the’,‘good’)

1.0

>>> cl.fprob(‘the’,‘bad’)

0.5

Because there are only five documents trained by the classifer, there are many words that only appear in one document. So, whichever category it is assigned, the other category will be 0 because it hasn’t appeared yet. This is not very reasonable, especially with frequent spam-ish words. Weighing a probability will start the probablity at 50% and then change as more occurences of the word appear.

 

Portfolio Assignment 7 March 24, 2009

Filed under: Data Mining — Lauren @ 6:02 pm

PCI Chapter 4 – Searching and Ranking

Everyone knows the most popular search engine in the world is Google with the help of the PageRank algorithm. Chapter 6 creates a search engine by collecting documents by crawling, indexing the locatings of different words and finally ranking pages to return to a user as a list. Google and other search engines are so fast, it’s hard to think of all the word that goes into a query. Until now, I’ve never really thought about what happens behind the scenes; I just Google it!

A Simple Crawler

The built-in page downloader library, urllib2, was easy to see in action. It downloaded an HTML page and can print out characters throughout the page, at different locations and ranges. urllib2 used in combination with BeautifulSoup will parse HTML and XML documents that are poorly written. I have to download and install BeautifulSoup, but to do that, I had to download WinZip so I can extract the tar.gz file that BeautifulSoup downloads for installation. I put the BeautifulSoup.py file in my Python directory and got it working.

The idea of “crawling” through the Internet completely baffles me. The Internet is such a huge and ever expanding entity that sometimes I think Google is magic. However, after entering the crawl method to searchengine.py, I have a greater understanding of how it is done. It is a pretty cleaver algorithm! Instead of testing the crawl method on kiwitobes, I used the real Wikipedia entry for Perl:

>>> pagelist = [http://en.wikipedia.org/wiki/Perl&#8217;]

>>> crawler = searchengine.crawler(”)

>>> crawler.crawl(pagelist)

Could not open http://en.wikipedia.org/wiki/Perl

Hmm, OK let’s try something different, like the UMW homepage. Now we’re talking!

>>> pagelist = [http://www.umw.edu&#8217;]

>>> crawler = searchengine.crawler(”)

>>> crawler.crawl(pagelist)

Indexing http://www.umw.edu

Indexing http://www.umw.edu/about/administration

Indexing http://www.umw.edu/featuredfaculty/stull

Indexing http://www.umw.edu/news

Indexing http://www.umw.edu/news/?a=1648

Indexing http://strategicplanning.umwblogs.org

Indexing http://www.umw.edu/academics

Indexing http://www.umw.edu/athletics

Indexing http://www.umw.edu/events

Indexing http://www.umw.edu/about

Indexing http://umwblogs.org

Indexing http://www.umw.edu/azindex

It still ends in an error… coming back to this later.

I installed pysqlite for Python 2.6 and got it working smoothly. I like how you use python strings to execute SQL queries to manipulate the database! It took me a while to get the database schema set up because there were a lot of typos and it kept giving me errors like “table urllist already exists”. So I renamed the database and got it working. The crawler class is complete and ready to test, if I can find a website that will work for it!

 

Portfolio Assignment 6

Filed under: Data Mining — Lauren @ 5:57 pm

Clustering Movies

For this assignment, my team (Andrew, Kurt, Will) and I tried to cluster a very large file with movie data. Once we got the text file to work in the readfile method, we ran it on my computer, waited, waited and waited. We knew ahead of time that it would take a while to cluster so we ran it during class (approximately 2.5 hours) and still nothing. 

Not knowing what to do next, I browsed my classmate’s blogs to see how they approached this movie data. The idea I tried next was to get rid of most of the column, except for two, and narrow the data down to 1,000 movies. I let it run for 5 minutes or so and finally got the Python command prompt back! But then I keep getting this error when I try to print the clusters out:

>>> movienames,categories,data=moviecluster.readfile(‘moviedata.txt’)

>>> clust = moviecluster.hcluster(data)

>>> moviecluster.printclust(clust,labels=movienames)

  Starman

Traceback (most recent call last):

  File “<pyshell#14>”, line 1, in <module>

    moviecluster.printclust(clust,labels=movienames)

  File “C:\Python26\moviecluster.py”, line 101, in printclust

    if clust.right!=None: printclust(clust.right,labels=labels,n=n+1)

  File “C:\Python26\moviecluster.py”, line 91, in printclust

    if clust.id<0:

AttributeError: ‘list’ object has no attribute ‘id’

>>> 

 

Portfolio Assignment 5 March 15, 2009

Filed under: Data Mining — Lauren @ 5:49 pm

Visualizations

Mans Rosling’s talk at the TED conference was so cool.  The Baby Name Wizard that Professor Zacharski showed in class caught my attention. I have showed it to all my friends and typed in all of their names, which is fun. As noted in the assignment description, some visualizations are purely artistic, like this Antarctic Animation which looks cool, but I have no idea what data it is trying to analyze!

I found an interesting baseball visualization that analyzes spending and performance. Towards the end of the season in 2008, about a month before the World Series, the number one ranked team, the Angels was spending a fair amount for their performance. However, the Rays who were ranked third behind the Angels were spending considerably less than the Angels which counteracts the idea that the best teams spend the most. The good ole Washington Nats did terrible last season :( but also did not spend nearly as much as the Angels or other expensive teams like Boston and New York.

There are all kinds of visualizations listed on this blog by Meryl K. Evans.

I created a visualization on Many Eyes by uploading a dataset for the price of gas. It only goes to 2004 so I’d like to find a more recent version. This is my practice visualization: Price of Gas 1976-2004.

As a member of the UMW Women’s Soccer team, I am particularly proud of this visualization. I took all the team articles from the athletics website and combined them into one large document. I also added every member of the 2008 team. I uploaded it to Many Eyes and got a cool advertisement and summary of the 2008 season! UMW Women’s Soccer 2008 Review. It’s a work in progress, I plan on editing the dataset to remove words that refer to other teams or anything else that would take away from the season and team in general. Go Eagles!

 

Portfolio Assignment 4 February 17, 2009

Filed under: Data Mining — Lauren @ 12:13 am

PCI Chapter 3

Installation déjà vu. Thank goodness my Dad loves Python! I downloaded feedparser and found myself in the same pickle as pydelicious. I called up my Dad and he said I have to add Python to the Windows path so it can recognize the ‘python‘ command.  Just in case anyone else has been having this problem, here is a helpful site. I added ;C:\Python;C:\Python\Scripts

While trying to simple set up the dataset to cluster, I got a lot of errors from copying word-for-word the code from the book. I looked up the unofficial errata for the book and there seem to be many problems with page 32! With help from Professor Zacharski, I got the code working and it provided a nice output file blogdata.txt.

The hierarchical cluster code was a little tough to follow because of all the syntax but I understand the overall purpose of hcluster. The printclust method is a neat way to output the results. I found a search engine cluster that includes: 

  • John Battelle’s Searchblog
  • The Official Google Blog
  • Search Engine Watch Blog
  • Google Operating System
  • Search Engine Roundtable

    Downloading and installing PIL was surprisingly easy! I got the test.jpg file to work from Appendix A! I figure understanding the code used in the PIL library is not essential for this class (more essential for a graphics class) so I just got the methods to work for the clusters and it outputed a pretty jpeg file for the dendrogram for the blogs.

    Dendrogram generated by clusters.drawdendrogram

    Dendrogram generated by clusters.drawdendrogram

     Switching the columns and rows in blogdata.txt and creating a new dendrogram showing word clusters took a long time to generate because there are more words than blogs. As I’m waiting for the jpeg image to render, I’m sure the picture will be huge. Viewing the image at 3% gets the entire image into the window. I’m not going to upload this picture but I’ll crop out an interesting word cluster I found:

    A cluster from the dendrogram showing word clusters

    A cluster from the dendrogram showing word clusters

    I like the idea of k-means clustering better than hierarchical clustering because you can define the number of clusters to form beforehand. This is more ideal in the real world because it allows you to form groups based on the results you would like to expect.  I played around with changing the number of centroids. First I did k=10, like the book. Then I tried k=14 and k=4. At first I thought the clusters would contain an equal amount of entries but that wasn’t the case:

    >>> [blognames[r] for r in kclust[0]]

    [“John Battelle’s Searchblog”, ‘Giga Omni Media, Inc.’, ‘Google Operating System’, ‘Gawker: Valleywag’, ‘Gizmodo’, ‘Lifehacker’, ‘Slashdot’, ‘Search Engine Watch Blog’, ‘Schneier on Security’, ‘Search Engine Roundtable’, ‘TechCrunch’, ‘mezzoblue’, ‘Matt Cutts: Gadgets, Google, and SEO’, ‘The Official Google Blog’, ‘Bloglines | News’, ‘Quick Online Tips’]

    >>> [blognames[r] for r in kclust[3]]

    [‘Joystiq’, ‘Download Squad’, ‘Engadget’, ‘Crooks and Liars’, “SpikedHumor – Today’s Videos and Pictures”, ‘The Unofficial Apple Weblog (TUAW)’, ‘Wired Top Stories’]

    I’m not familiar with most of the blogs, but in cluster 0, there are a lot of search engine blogs which is similar to the heirarchical clustering. Below is a cluster from the Zebo dataset that I found was interesting. It combines the desire to have human interaction as well as material wealth. 

    Clusters of things that people want

    Clusters of things that people want

    Of all the data mining techniques we have studied so far, I think clustering is the most useful tool in the real world. There are a lot of datasets that could be clustered to find useful information for marketing as well as for fun.

     

    Portfolio Assignment 3 February 10, 2009

    Filed under: Data Mining — Lauren @ 1:42 am

    I worked on this assignment with my Team: Andrew Nelson, Kurt Koller and Will Boyd.

    This first thing we did was to download the Last.fm API. Kurt signed up on Last.fm to register to use the API. We decided to use Python for the project so we downloaded and installed pylast from Google Code. Copying the files into the Python directory and then using the command import pylast in IDLE gave us no errors so the install was successful.

    To begin our recommendation system, we started simple with getting similar artists to a given artist:

    >>> artist = pylast.Artist(“The Black Keys”,”bd46f9bce716e11a6d311d77c06d2159″,”a313f7a6a587763c71eeb3cac498ca”,”)

    >>> artist.get_similar()

    and got a large list of similar artists! Success!

    Right now we are debating whether to use Python as a command-line interface or to switch to PhP to make a simple GUI application. Will has the most experience in PhP so he has taken the lead on trying to get a simple GUI up and running. Of course we will now have to download and install the last.fm API for PhP. While Will works on the PhP version, Kurt, Andrew and I are working on the Python command-line.

    To create our system, we’d like to create a menu with a list of options the user can choose:

    1. Input artist, output list of similar artists: artist.get_similar()
    2. Input artist, output top tracks: artist.get_top_tracks()
    3. Input track, output list of similar songs: track.getSimilar()

    So, we started our basic command-line menu system. Should be pretty simple right, especially with Python. First issue we ran into was our if-statement. Prompt the user for a menu item, and then go into that if-block. We created a simple test to see if we could get option 1 from above to work. User typed in a 1 and it wouldn’t go into the corresponding if-block! WHY! We did some research online and our syntax was correct. We figured out that the raw_input() method returns a String and we were testing the value of 1 as an integer! Problem solved, we got option 1 to work!

    Option 2 is not working as smoothly as the first. We keep getting errors for the method get_top_tracks(). Let me just say that the last.fm API is terrible! The syntax is wrong and the parameter values are off. The API for get_top_tracks() says it takes two arguments but when we input them into our program, we get a syntax error saying it only takes one parameter.

    Code so far:

    import pylast
    
    print "Welcome to the Team 3 Pylast Recommender!"
    print "To find an artist similar to your artist, press (1)"
    print "To find a list of top tracks by an artist, press (2)"
    print "To find a list of songs similar to a particular song, press (3)"
    print "To quit, press (4)"
    
    input = raw_input(">")
    print input
    
    if (input=="1"):
        print "Please enter the name of an Artist"
        similarArtistName = raw_input(">")
        similarArtist = pylast.Artist("similarArtistName", "bd46f9bce716e11a6d311d77c06d2159", "a313f7a6a587763c71eeb3cac498ca40", '')
        print similarArtist.get_similar()
    elif (input=="2"):
        print "Please enter the name of an Artist"
        trackArtistName = raw_input(">")
        trackArtist = pylast.Artist("trackArtistName", "bd46f9bce716e11a6d311d77c06d2159", "a313f7a6a587763c71eeb3cac498ca40", '')
        print pylast.Artist.get_top_tracks(trackArtist)

    Will demo’ed a great PhP recommendation system in class. He used these simple last.fm API methods:  Artist.get_similar(), Track.get_similar(), Geo.get_top_artist() and Geo.get_top_tracks(). Check it out here!