Lauren's Blog

stop and smell the roses

Portfolio Assignment 9 April 23, 2009

Filed under: Data Mining — Lauren @ 4:42 am

Final Project

For our final project, my team (Andrew, Kurt, Will) and I will be expanding on our work from last week with document filtering.

The Problem

As you well know, spam is a very annoying and persistant presence on the Internet. In chapter 6 of PCI, we learned that rule-based classifier don’t cut it because spammers are getting smarter. So, we created a learning classifier that is trained on data and gives a document a category depending on word or feature probabilities. The only guidelines for our project was to use a substantial dataset. The algorithm in the book uses strings as “documents”. We want to use real email documents to train the classifier and use it for future classifications.

The Data

At first, we searched the Internet for some fun spam datasets to download. Of course, there were a ton! But the way we planned to modify the classifying algorithm in the book was to use email text and we kept finding weird formats for the datasets. So, Will logged into his old Yahoo email account and found 1,400 spam emails. I’m pretty sure if I logged into my old AOL account I would find a similar number! At first we thought we were going to have to use the sampletrain method from the book and type the name of every file into a line of code. That would take forever and make the algorithm not very realistic in real life. Will whipped up a function to rename all of his emails into a format of either spam#.txt or nonspam#.txt:

def openfiles(cl):

    data = open(‘blogsplogreal.txt’, ‘r’)

    lines = data.readlines();

    for i in lines:

        thisline = i.split(” “);

        filename = thisline[1];

        print ‘opening: ‘ + filename;

        if thisline[2] == “1\n”:

            spamtype = ‘spam’;

        else:

            spamtype = ‘not-spam’;

        print ‘file type: ‘ + spamtype;

        cl.train(filename, spamtype);

This was useful because we created a loop to train the classifier by concatenating the basename of the file (spam or nonspam), the number and ‘.txt’:

def sampletrain(cl,basefile,numfiles,gory):

    for i in range(1,numfiles+1):

        filename =  basefile + str(i) + ‘.txt’

        #print filename

        cl.train(filename,gory)

The Solution

We used Bayes (when in doubt, use Bayes!) to train and classify documents. Starting off the books code, we had to edit the getwords method to open a file and add the words to the dictionary. A friendly neighbor in the lab showed us how to do file I/O and this is what we came up with:

 

def getwords(doc):

    data = open(doc, ‘r’)

    lines = ‘ ‘

    for line in data:

        lines+=line

    #print lines

    splitter = re.compile(‘\\W*’)

    # Split the words by non-alpha characters

    words = [s.lower() for s in splitter.split(lines) if len(s) > 2 and len(s) < 20]

    #print words

    # Return the unique set of words only

    return dict([(w,1) for w in words])

This opens a file and concatenates each line into one big string. Then it is split up and converted to lowercase as the book does. Now, we can send the classifier a filename and it will get the features and resume with the same algorithm.

The Results

We were very excited to see that our modifications to allow files to be trained compiled! It took some serious looking at the code to make sure we were doing it correctly, but now we really understand what is going on. Starting small, we trained two documents, one that was spam and one that was not. It correctly added the categories and features to the dictionary- success number 1! Then we trained a few more documents and gave it a unknown document to classify and it worked! It classified 100% of 4 document correctly. Now that we know it works, we ran the algorithm with Will’s mixed spam and nonspam files. Tomorrow we’re going to run it with a combination of unknown document and see how it classifies in front of the class.

 

Portfolio Assignment 8 April 11, 2009

Filed under: Data Mining — Lauren @ 8:53 pm

PCI Chapter 6 – Document Filtering

Next class, my team (Will, Andrew and Kurt) and I will be presenting Chapter 6 in PCI.  We have divided the chapter into major sections, and I will be discussing the first three, starting with Filtering Spam.

Filtering Spam

Why do we need to classify documents based on their contents? To eliminate spam! For my own Gmail account, I use many rule-based, spam-elminating, organizational methods. But, as the chapter agrees, this isn’t a perfect approach. Because I use word matching, sometimes an email that is destined for the “annoying UMW administration email” folder, ends up in my inbox (which I promptly delete). I also use email address matching to filter my messages, but some addresses could go into more than one folder, depending on the content of the message. So, what to do? How about programs that learn based on what you tell them is spam, and isn’t and continues to do this not just initially, but as you receive more email.

Documents and Words

Some words appear more frequently in spam, and therefore those words will determine whether a document is spam or not. Also, there are words that commonly show up in spam, but could also be important in an email that is not spam. getwords separates the document into words by splitting the stream when it encounters a character that is not a letter. This means that words with apostrophies are separated into separate words. For example, they’re would be come two words: they and re.

Training the Classifier

As we know, the more examples of documents with correct classifications a classifer sees, the better it will become at correctly classifying new documents. After adding the classifier class and its helper methods, I ran the code described on page 121 to check to see if it was working. Up untill now, I really didn’t understand what the class was doing with the features and categories. The example input helped me understand how this classifier will work. 

Calculating Probabilities

The probability that a words is in a particular category will make certain words more likely to show up in spam. For example, if the word ‘Viagra’ appears a lot more in the ‘bad’ category than the ‘good’ category, it has a high probability to be a spam word. A word like ‘the’ is probably not a good spam classifier because it is so common:

>>> cl.fprob(‘the’,‘good’)

1.0

>>> cl.fprob(‘the’,‘bad’)

0.5

Because there are only five documents trained by the classifer, there are many words that only appear in one document. So, whichever category it is assigned, the other category will be 0 because it hasn’t appeared yet. This is not very reasonable, especially with frequent spam-ish words. Weighing a probability will start the probablity at 50% and then change as more occurences of the word appear.