Lauren's Blog

stop and smell the roses

Portfolio Assignment 9 April 23, 2009

Filed under: Data Mining — Lauren @ 4:42 am

Final Project

For our final project, my team (Andrew, Kurt, Will) and I will be expanding on our work from last week with document filtering.

The Problem

As you well know, spam is a very annoying and persistant presence on the Internet. In chapter 6 of PCI, we learned that rule-based classifier don’t cut it because spammers are getting smarter. So, we created a learning classifier that is trained on data and gives a document a category depending on word or feature probabilities. The only guidelines for our project was to use a substantial dataset. The algorithm in the book uses strings as “documents”. We want to use real email documents to train the classifier and use it for future classifications.

The Data

At first, we searched the Internet for some fun spam datasets to download. Of course, there were a ton! But the way we planned to modify the classifying algorithm in the book was to use email text and we kept finding weird formats for the datasets. So, Will logged into his old Yahoo email account and found 1,400 spam emails. I’m pretty sure if I logged into my old AOL account I would find a similar number! At first we thought we were going to have to use the sampletrain method from the book and type the name of every file into a line of code. That would take forever and make the algorithm not very realistic in real life. Will whipped up a function to rename all of his emails into a format of either spam#.txt or nonspam#.txt:

def openfiles(cl):

    data = open(‘blogsplogreal.txt’, ‘r’)

    lines = data.readlines();

    for i in lines:

        thisline = i.split(” “);

        filename = thisline[1];

        print ‘opening: ‘ + filename;

        if thisline[2] == “1\n”:

            spamtype = ‘spam’;


            spamtype = ‘not-spam’;

        print ‘file type: ‘ + spamtype;

        cl.train(filename, spamtype);

This was useful because we created a loop to train the classifier by concatenating the basename of the file (spam or nonspam), the number and ‘.txt’:

def sampletrain(cl,basefile,numfiles,gory):

    for i in range(1,numfiles+1):

        filename =  basefile + str(i) + ‘.txt’

        #print filename


The Solution

We used Bayes (when in doubt, use Bayes!) to train and classify documents. Starting off the books code, we had to edit the getwords method to open a file and add the words to the dictionary. A friendly neighbor in the lab showed us how to do file I/O and this is what we came up with:


def getwords(doc):

    data = open(doc, ‘r’)

    lines = ‘ ‘

    for line in data:


    #print lines

    splitter = re.compile(‘\\W*’)

    # Split the words by non-alpha characters

    words = [s.lower() for s in splitter.split(lines) if len(s) > 2 and len(s) < 20]

    #print words

    # Return the unique set of words only

    return dict([(w,1) for w in words])

This opens a file and concatenates each line into one big string. Then it is split up and converted to lowercase as the book does. Now, we can send the classifier a filename and it will get the features and resume with the same algorithm.

The Results

We were very excited to see that our modifications to allow files to be trained compiled! It took some serious looking at the code to make sure we were doing it correctly, but now we really understand what is going on. Starting small, we trained two documents, one that was spam and one that was not. It correctly added the categories and features to the dictionary- success number 1! Then we trained a few more documents and gave it a unknown document to classify and it worked! It classified 100% of 4 document correctly. Now that we know it works, we ran the algorithm with Will’s mixed spam and nonspam files. Tomorrow we’re going to run it with a combination of unknown document and see how it classifies in front of the class.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s