For our final project, my team (Andrew, Kurt, Will) and I will be expanding on our work from last week with document filtering.
As you well know, spam is a very annoying and persistant presence on the Internet. In chapter 6 of PCI, we learned that rule-based classifier don’t cut it because spammers are getting smarter. So, we created a learning classifier that is trained on data and gives a document a category depending on word or feature probabilities. The only guidelines for our project was to use a substantial dataset. The algorithm in the book uses strings as “documents”. We want to use real email documents to train the classifier and use it for future classifications.
At first, we searched the Internet for some fun spam datasets to download. Of course, there were a ton! But the way we planned to modify the classifying algorithm in the book was to use email text and we kept finding weird formats for the datasets. So, Will logged into his old Yahoo email account and found 1,400 spam emails. I’m pretty sure if I logged into my old AOL account I would find a similar number! At first we thought we were going to have to use the sampletrain method from the book and type the name of every file into a line of code. That would take forever and make the algorithm not very realistic in real life. Will whipped up a function to rename all of his emails into a format of either spam#.txt or nonspam#.txt:
data = open(‘blogsplogreal.txt’, ‘r’)
lines = data.readlines();
for i in lines:
thisline = i.split(” “);
filename = thisline;
print ‘opening: ‘ + filename;
if thisline == “1\n”:
spamtype = ‘spam’;
spamtype = ‘not-spam’;
print ‘file type: ‘ + spamtype;
This was useful because we created a loop to train the classifier by concatenating the basename of the file (spam or nonspam), the number and ‘.txt’:
for i in range(1,numfiles+1):
filename = basefile + str(i) + ‘.txt’
We used Bayes (when in doubt, use Bayes!) to train and classify documents. Starting off the books code, we had to edit the getwords method to open a file and add the words to the dictionary. A friendly neighbor in the lab showed us how to do file I/O and this is what we came up with:
data = open(doc, ‘r’)
lines = ‘ ‘
for line in data:
splitter = re.compile(‘\\W*’)
# Split the words by non-alpha characters
words = [s.lower() for s in splitter.split(lines) if len(s) > 2 and len(s) < 20]
# Return the unique set of words only
return dict([(w,1) for w in words])
This opens a file and concatenates each line into one big string. Then it is split up and converted to lowercase as the book does. Now, we can send the classifier a filename and it will get the features and resume with the same algorithm.
We were very excited to see that our modifications to allow files to be trained compiled! It took some serious looking at the code to make sure we were doing it correctly, but now we really understand what is going on. Starting small, we trained two documents, one that was spam and one that was not. It correctly added the categories and features to the dictionary- success number 1! Then we trained a few more documents and gave it a unknown document to classify and it worked! It classified 100% of 4 document correctly. Now that we know it works, we ran the algorithm with Will’s mixed spam and nonspam files. Tomorrow we’re going to run it with a combination of unknown document and see how it classifies in front of the class.