Lauren's Blog

stop and smell the roses

Disable Balloon Tips July 29, 2009

Filed under: Windows XP — Lauren @ 3:51 pm

I hate clutter in my System Tray. I hate when software runs when Windows starts and adds another icon. I especially hate the annoying information balloons that pop up all the time! Why do I need a balloon notifying me that “a network cable has been disconnected” when the icon changes to a computer with a big, red X on it? Or the “Found new hardware” balloon every time you plug in your flash drive? I would be OK with them if they eventually went away like the Outlook Desktop Alert notifications, but you have to click the balloons to make them disappear! To much clicking!

I’m all about customization so I wanted to get rid of them asap. I checked all the Windows Display Settings and Taskbar Settings and didn’t find anything for balloon tips so I Googled. Here are my Google Search Results: “get rid of windows balloons”.

The first two hits gave me what I wanted. Since this solution involves editing the Windows Registry, I checked a couple more articles to make sure the solutions were similar. I don’t like to mess with the operating system but I had to get rid of those balloons! The PCMAG article is the one I followed initially:

Get Rid of Those Pesky Balloons!

  1. From the Start button select Run (Windows Logo + R)
  2. Type regedit and hit Enter to open the Registry Editor
  3. Go to HKEY_CURRENT_USER → software → microsoft → windows → currentversion → explorer → advanced
  4. Under Edit select New → DWORD Value
  5. Type EnableBalloonTips and hit Enter
  6. Close the Registry Editor and Log Out/Log In again to enable the change

At first I was confused why I was typing EnableBalloonTips when I want to Disable them but reading the WindowsNetworking article I learned that assigning the value to 0 would disable it. To enable the balloon tips set the value to 1.

After I logged back in, I tested it out by unplugging my Ethernet cable. No balloon! Thanks PCMag :)

Advertisements
 

Portfolio Assignment 9 April 23, 2009

Filed under: Data Mining — Lauren @ 4:42 am

Final Project

For our final project, my team (Andrew, Kurt, Will) and I will be expanding on our work from last week with document filtering.

The Problem

As you well know, spam is a very annoying and persistant presence on the Internet. In chapter 6 of PCI, we learned that rule-based classifier don’t cut it because spammers are getting smarter. So, we created a learning classifier that is trained on data and gives a document a category depending on word or feature probabilities. The only guidelines for our project was to use a substantial dataset. The algorithm in the book uses strings as “documents”. We want to use real email documents to train the classifier and use it for future classifications.

The Data

At first, we searched the Internet for some fun spam datasets to download. Of course, there were a ton! But the way we planned to modify the classifying algorithm in the book was to use email text and we kept finding weird formats for the datasets. So, Will logged into his old Yahoo email account and found 1,400 spam emails. I’m pretty sure if I logged into my old AOL account I would find a similar number! At first we thought we were going to have to use the sampletrain method from the book and type the name of every file into a line of code. That would take forever and make the algorithm not very realistic in real life. Will whipped up a function to rename all of his emails into a format of either spam#.txt or nonspam#.txt:

def openfiles(cl):

    data = open(‘blogsplogreal.txt’, ‘r’)

    lines = data.readlines();

    for i in lines:

        thisline = i.split(” “);

        filename = thisline[1];

        print ‘opening: ‘ + filename;

        if thisline[2] == “1\n”:

            spamtype = ‘spam’;

        else:

            spamtype = ‘not-spam’;

        print ‘file type: ‘ + spamtype;

        cl.train(filename, spamtype);

This was useful because we created a loop to train the classifier by concatenating the basename of the file (spam or nonspam), the number and ‘.txt’:

def sampletrain(cl,basefile,numfiles,gory):

    for i in range(1,numfiles+1):

        filename =  basefile + str(i) + ‘.txt’

        #print filename

        cl.train(filename,gory)

The Solution

We used Bayes (when in doubt, use Bayes!) to train and classify documents. Starting off the books code, we had to edit the getwords method to open a file and add the words to the dictionary. A friendly neighbor in the lab showed us how to do file I/O and this is what we came up with:

 

def getwords(doc):

    data = open(doc, ‘r’)

    lines = ‘ ‘

    for line in data:

        lines+=line

    #print lines

    splitter = re.compile(‘\\W*’)

    # Split the words by non-alpha characters

    words = [s.lower() for s in splitter.split(lines) if len(s) > 2 and len(s) < 20]

    #print words

    # Return the unique set of words only

    return dict([(w,1) for w in words])

This opens a file and concatenates each line into one big string. Then it is split up and converted to lowercase as the book does. Now, we can send the classifier a filename and it will get the features and resume with the same algorithm.

The Results

We were very excited to see that our modifications to allow files to be trained compiled! It took some serious looking at the code to make sure we were doing it correctly, but now we really understand what is going on. Starting small, we trained two documents, one that was spam and one that was not. It correctly added the categories and features to the dictionary- success number 1! Then we trained a few more documents and gave it a unknown document to classify and it worked! It classified 100% of 4 document correctly. Now that we know it works, we ran the algorithm with Will’s mixed spam and nonspam files. Tomorrow we’re going to run it with a combination of unknown document and see how it classifies in front of the class.

 

Portfolio Assignment 8 April 11, 2009

Filed under: Data Mining — Lauren @ 8:53 pm

PCI Chapter 6 – Document Filtering

Next class, my team (Will, Andrew and Kurt) and I will be presenting Chapter 6 in PCI.  We have divided the chapter into major sections, and I will be discussing the first three, starting with Filtering Spam.

Filtering Spam

Why do we need to classify documents based on their contents? To eliminate spam! For my own Gmail account, I use many rule-based, spam-elminating, organizational methods. But, as the chapter agrees, this isn’t a perfect approach. Because I use word matching, sometimes an email that is destined for the “annoying UMW administration email” folder, ends up in my inbox (which I promptly delete). I also use email address matching to filter my messages, but some addresses could go into more than one folder, depending on the content of the message. So, what to do? How about programs that learn based on what you tell them is spam, and isn’t and continues to do this not just initially, but as you receive more email.

Documents and Words

Some words appear more frequently in spam, and therefore those words will determine whether a document is spam or not. Also, there are words that commonly show up in spam, but could also be important in an email that is not spam. getwords separates the document into words by splitting the stream when it encounters a character that is not a letter. This means that words with apostrophies are separated into separate words. For example, they’re would be come two words: they and re.

Training the Classifier

As we know, the more examples of documents with correct classifications a classifer sees, the better it will become at correctly classifying new documents. After adding the classifier class and its helper methods, I ran the code described on page 121 to check to see if it was working. Up untill now, I really didn’t understand what the class was doing with the features and categories. The example input helped me understand how this classifier will work. 

Calculating Probabilities

The probability that a words is in a particular category will make certain words more likely to show up in spam. For example, if the word ‘Viagra’ appears a lot more in the ‘bad’ category than the ‘good’ category, it has a high probability to be a spam word. A word like ‘the’ is probably not a good spam classifier because it is so common:

>>> cl.fprob(‘the’,‘good’)

1.0

>>> cl.fprob(‘the’,‘bad’)

0.5

Because there are only five documents trained by the classifer, there are many words that only appear in one document. So, whichever category it is assigned, the other category will be 0 because it hasn’t appeared yet. This is not very reasonable, especially with frequent spam-ish words. Weighing a probability will start the probablity at 50% and then change as more occurences of the word appear.

 

Portfolio Assignment 7 March 24, 2009

Filed under: Data Mining — Lauren @ 6:02 pm

PCI Chapter 4 – Searching and Ranking

Everyone knows the most popular search engine in the world is Google with the help of the PageRank algorithm. Chapter 6 creates a search engine by collecting documents by crawling, indexing the locatings of different words and finally ranking pages to return to a user as a list. Google and other search engines are so fast, it’s hard to think of all the word that goes into a query. Until now, I’ve never really thought about what happens behind the scenes; I just Google it!

A Simple Crawler

The built-in page downloader library, urllib2, was easy to see in action. It downloaded an HTML page and can print out characters throughout the page, at different locations and ranges. urllib2 used in combination with BeautifulSoup will parse HTML and XML documents that are poorly written. I have to download and install BeautifulSoup, but to do that, I had to download WinZip so I can extract the tar.gz file that BeautifulSoup downloads for installation. I put the BeautifulSoup.py file in my Python directory and got it working.

The idea of “crawling” through the Internet completely baffles me. The Internet is such a huge and ever expanding entity that sometimes I think Google is magic. However, after entering the crawl method to searchengine.py, I have a greater understanding of how it is done. It is a pretty cleaver algorithm! Instead of testing the crawl method on kiwitobes, I used the real Wikipedia entry for Perl:

>>> pagelist = [http://en.wikipedia.org/wiki/Perl&#8217;]

>>> crawler = searchengine.crawler(”)

>>> crawler.crawl(pagelist)

Could not open http://en.wikipedia.org/wiki/Perl

Hmm, OK let’s try something different, like the UMW homepage. Now we’re talking!

>>> pagelist = [http://www.umw.edu&#8217;]

>>> crawler = searchengine.crawler(”)

>>> crawler.crawl(pagelist)

Indexing http://www.umw.edu

Indexing http://www.umw.edu/about/administration

Indexing http://www.umw.edu/featuredfaculty/stull

Indexing http://www.umw.edu/news

Indexing http://www.umw.edu/news/?a=1648

Indexing http://strategicplanning.umwblogs.org

Indexing http://www.umw.edu/academics

Indexing http://www.umw.edu/athletics

Indexing http://www.umw.edu/events

Indexing http://www.umw.edu/about

Indexing http://umwblogs.org

Indexing http://www.umw.edu/azindex

It still ends in an error… coming back to this later.

I installed pysqlite for Python 2.6 and got it working smoothly. I like how you use python strings to execute SQL queries to manipulate the database! It took me a while to get the database schema set up because there were a lot of typos and it kept giving me errors like “table urllist already exists”. So I renamed the database and got it working. The crawler class is complete and ready to test, if I can find a website that will work for it!

 

Portfolio Assignment 6

Filed under: Data Mining — Lauren @ 5:57 pm

Clustering Movies

For this assignment, my team (Andrew, Kurt, Will) and I tried to cluster a very large file with movie data. Once we got the text file to work in the readfile method, we ran it on my computer, waited, waited and waited. We knew ahead of time that it would take a while to cluster so we ran it during class (approximately 2.5 hours) and still nothing. 

Not knowing what to do next, I browsed my classmate’s blogs to see how they approached this movie data. The idea I tried next was to get rid of most of the column, except for two, and narrow the data down to 1,000 movies. I let it run for 5 minutes or so and finally got the Python command prompt back! But then I keep getting this error when I try to print the clusters out:

>>> movienames,categories,data=moviecluster.readfile(‘moviedata.txt’)

>>> clust = moviecluster.hcluster(data)

>>> moviecluster.printclust(clust,labels=movienames)

  Starman

Traceback (most recent call last):

  File “<pyshell#14>”, line 1, in <module>

    moviecluster.printclust(clust,labels=movienames)

  File “C:\Python26\moviecluster.py”, line 101, in printclust

    if clust.right!=None: printclust(clust.right,labels=labels,n=n+1)

  File “C:\Python26\moviecluster.py”, line 91, in printclust

    if clust.id<0:

AttributeError: ‘list’ object has no attribute ‘id’

>>> 

 

Portfolio Assignment 5 March 15, 2009

Filed under: Data Mining — Lauren @ 5:49 pm

Visualizations

Mans Rosling’s talk at the TED conference was so cool.  The Baby Name Wizard that Professor Zacharski showed in class caught my attention. I have showed it to all my friends and typed in all of their names, which is fun. As noted in the assignment description, some visualizations are purely artistic, like this Antarctic Animation which looks cool, but I have no idea what data it is trying to analyze!

I found an interesting baseball visualization that analyzes spending and performance. Towards the end of the season in 2008, about a month before the World Series, the number one ranked team, the Angels was spending a fair amount for their performance. However, the Rays who were ranked third behind the Angels were spending considerably less than the Angels which counteracts the idea that the best teams spend the most. The good ole Washington Nats did terrible last season :( but also did not spend nearly as much as the Angels or other expensive teams like Boston and New York.

There are all kinds of visualizations listed on this blog by Meryl K. Evans.

I created a visualization on Many Eyes by uploading a dataset for the price of gas. It only goes to 2004 so I’d like to find a more recent version. This is my practice visualization: Price of Gas 1976-2004.

As a member of the UMW Women’s Soccer team, I am particularly proud of this visualization. I took all the team articles from the athletics website and combined them into one large document. I also added every member of the 2008 team. I uploaded it to Many Eyes and got a cool advertisement and summary of the 2008 season! UMW Women’s Soccer 2008 Review. It’s a work in progress, I plan on editing the dataset to remove words that refer to other teams or anything else that would take away from the season and team in general. Go Eagles!

 

Portfolio Assignment 4 February 17, 2009

Filed under: Data Mining — Lauren @ 12:13 am

PCI Chapter 3

Installation déjà vu. Thank goodness my Dad loves Python! I downloaded feedparser and found myself in the same pickle as pydelicious. I called up my Dad and he said I have to add Python to the Windows path so it can recognize the ‘python‘ command.  Just in case anyone else has been having this problem, here is a helpful site. I added ;C:\Python;C:\Python\Scripts

While trying to simple set up the dataset to cluster, I got a lot of errors from copying word-for-word the code from the book. I looked up the unofficial errata for the book and there seem to be many problems with page 32! With help from Professor Zacharski, I got the code working and it provided a nice output file blogdata.txt.

The hierarchical cluster code was a little tough to follow because of all the syntax but I understand the overall purpose of hcluster. The printclust method is a neat way to output the results. I found a search engine cluster that includes: 

  • John Battelle’s Searchblog
  • The Official Google Blog
  • Search Engine Watch Blog
  • Google Operating System
  • Search Engine Roundtable

    Downloading and installing PIL was surprisingly easy! I got the test.jpg file to work from Appendix A! I figure understanding the code used in the PIL library is not essential for this class (more essential for a graphics class) so I just got the methods to work for the clusters and it outputed a pretty jpeg file for the dendrogram for the blogs.

    Dendrogram generated by clusters.drawdendrogram

    Dendrogram generated by clusters.drawdendrogram

     Switching the columns and rows in blogdata.txt and creating a new dendrogram showing word clusters took a long time to generate because there are more words than blogs. As I’m waiting for the jpeg image to render, I’m sure the picture will be huge. Viewing the image at 3% gets the entire image into the window. I’m not going to upload this picture but I’ll crop out an interesting word cluster I found:

    A cluster from the dendrogram showing word clusters

    A cluster from the dendrogram showing word clusters

    I like the idea of k-means clustering better than hierarchical clustering because you can define the number of clusters to form beforehand. This is more ideal in the real world because it allows you to form groups based on the results you would like to expect.  I played around with changing the number of centroids. First I did k=10, like the book. Then I tried k=14 and k=4. At first I thought the clusters would contain an equal amount of entries but that wasn’t the case:

    >>> [blognames[r] for r in kclust[0]]

    [“John Battelle’s Searchblog”, ‘Giga Omni Media, Inc.’, ‘Google Operating System’, ‘Gawker: Valleywag’, ‘Gizmodo’, ‘Lifehacker’, ‘Slashdot’, ‘Search Engine Watch Blog’, ‘Schneier on Security’, ‘Search Engine Roundtable’, ‘TechCrunch’, ‘mezzoblue’, ‘Matt Cutts: Gadgets, Google, and SEO’, ‘The Official Google Blog’, ‘Bloglines | News’, ‘Quick Online Tips’]

    >>> [blognames[r] for r in kclust[3]]

    [‘Joystiq’, ‘Download Squad’, ‘Engadget’, ‘Crooks and Liars’, “SpikedHumor – Today’s Videos and Pictures”, ‘The Unofficial Apple Weblog (TUAW)’, ‘Wired Top Stories’]

    I’m not familiar with most of the blogs, but in cluster 0, there are a lot of search engine blogs which is similar to the heirarchical clustering. Below is a cluster from the Zebo dataset that I found was interesting. It combines the desire to have human interaction as well as material wealth. 

    Clusters of things that people want

    Clusters of things that people want

    Of all the data mining techniques we have studied so far, I think clustering is the most useful tool in the real world. There are a lot of datasets that could be clustered to find useful information for marketing as well as for fun.