PCI Chapter 6 – Document Filtering
Next class, my team (Will, Andrew and Kurt) and I will be presenting Chapter 6 in PCI. We have divided the chapter into major sections, and I will be discussing the first three, starting with Filtering Spam.
Why do we need to classify documents based on their contents? To eliminate spam! For my own Gmail account, I use many rule-based, spam-elminating, organizational methods. But, as the chapter agrees, this isn’t a perfect approach. Because I use word matching, sometimes an email that is destined for the “annoying UMW administration email” folder, ends up in my inbox (which I promptly delete). I also use email address matching to filter my messages, but some addresses could go into more than one folder, depending on the content of the message. So, what to do? How about programs that learn based on what you tell them is spam, and isn’t and continues to do this not just initially, but as you receive more email.
Documents and Words
Some words appear more frequently in spam, and therefore those words will determine whether a document is spam or not. Also, there are words that commonly show up in spam, but could also be important in an email that is not spam. getwords separates the document into words by splitting the stream when it encounters a character that is not a letter. This means that words with apostrophies are separated into separate words. For example, they’re would be come two words: they and re.
Training the Classifier
As we know, the more examples of documents with correct classifications a classifer sees, the better it will become at correctly classifying new documents. After adding the classifier class and its helper methods, I ran the code described on page 121 to check to see if it was working. Up untill now, I really didn’t understand what the class was doing with the features and categories. The example input helped me understand how this classifier will work.
The probability that a words is in a particular category will make certain words more likely to show up in spam. For example, if the word ‘Viagra’ appears a lot more in the ‘bad’ category than the ‘good’ category, it has a high probability to be a spam word. A word like ‘the’ is probably not a good spam classifier because it is so common:
Because there are only five documents trained by the classifer, there are many words that only appear in one document. So, whichever category it is assigned, the other category will be 0 because it hasn’t appeared yet. This is not very reasonable, especially with frequent spam-ish words. Weighing a probability will start the probablity at 50% and then change as more occurences of the word appear.