Lauren's Blog

stop and smell the roses

Portfolio Assignment 4 February 17, 2009

Filed under: Data Mining — Lauren @ 12:13 am

PCI Chapter 3

Installation déjà vu. Thank goodness my Dad loves Python! I downloaded feedparser and found myself in the same pickle as pydelicious. I called up my Dad and he said I have to add Python to the Windows path so it can recognize the ‘python‘ command.  Just in case anyone else has been having this problem, here is a helpful site. I added ;C:\Python;C:\Python\Scripts

While trying to simple set up the dataset to cluster, I got a lot of errors from copying word-for-word the code from the book. I looked up the unofficial errata for the book and there seem to be many problems with page 32! With help from Professor Zacharski, I got the code working and it provided a nice output file blogdata.txt.

The hierarchical cluster code was a little tough to follow because of all the syntax but I understand the overall purpose of hcluster. The printclust method is a neat way to output the results. I found a search engine cluster that includes: 

  • John Battelle’s Searchblog
  • The Official Google Blog
  • Search Engine Watch Blog
  • Google Operating System
  • Search Engine Roundtable

    Downloading and installing PIL was surprisingly easy! I got the test.jpg file to work from Appendix A! I figure understanding the code used in the PIL library is not essential for this class (more essential for a graphics class) so I just got the methods to work for the clusters and it outputed a pretty jpeg file for the dendrogram for the blogs.

    Dendrogram generated by clusters.drawdendrogram

    Dendrogram generated by clusters.drawdendrogram

     Switching the columns and rows in blogdata.txt and creating a new dendrogram showing word clusters took a long time to generate because there are more words than blogs. As I’m waiting for the jpeg image to render, I’m sure the picture will be huge. Viewing the image at 3% gets the entire image into the window. I’m not going to upload this picture but I’ll crop out an interesting word cluster I found:

    A cluster from the dendrogram showing word clusters

    A cluster from the dendrogram showing word clusters

    I like the idea of k-means clustering better than hierarchical clustering because you can define the number of clusters to form beforehand. This is more ideal in the real world because it allows you to form groups based on the results you would like to expect.  I played around with changing the number of centroids. First I did k=10, like the book. Then I tried k=14 and k=4. At first I thought the clusters would contain an equal amount of entries but that wasn’t the case:

    >>> [blognames[r] for r in kclust[0]]

    [“John Battelle’s Searchblog”, ‘Giga Omni Media, Inc.’, ‘Google Operating System’, ‘Gawker: Valleywag’, ‘Gizmodo’, ‘Lifehacker’, ‘Slashdot’, ‘Search Engine Watch Blog’, ‘Schneier on Security’, ‘Search Engine Roundtable’, ‘TechCrunch’, ‘mezzoblue’, ‘Matt Cutts: Gadgets, Google, and SEO’, ‘The Official Google Blog’, ‘Bloglines | News’, ‘Quick Online Tips’]

    >>> [blognames[r] for r in kclust[3]]

    [‘Joystiq’, ‘Download Squad’, ‘Engadget’, ‘Crooks and Liars’, “SpikedHumor – Today’s Videos and Pictures”, ‘The Unofficial Apple Weblog (TUAW)’, ‘Wired Top Stories’]

    I’m not familiar with most of the blogs, but in cluster 0, there are a lot of search engine blogs which is similar to the heirarchical clustering. Below is a cluster from the Zebo dataset that I found was interesting. It combines the desire to have human interaction as well as material wealth. 

    Clusters of things that people want

    Clusters of things that people want

    Of all the data mining techniques we have studied so far, I think clustering is the most useful tool in the real world. There are a lot of datasets that could be clustered to find useful information for marketing as well as for fun.

     

    Portfolio Assignment 3 February 10, 2009

    Filed under: Data Mining — Lauren @ 1:42 am

    I worked on this assignment with my Team: Andrew Nelson, Kurt Koller and Will Boyd.

    This first thing we did was to download the Last.fm API. Kurt signed up on Last.fm to register to use the API. We decided to use Python for the project so we downloaded and installed pylast from Google Code. Copying the files into the Python directory and then using the command import pylast in IDLE gave us no errors so the install was successful.

    To begin our recommendation system, we started simple with getting similar artists to a given artist:

    >>> artist = pylast.Artist(“The Black Keys”,”bd46f9bce716e11a6d311d77c06d2159″,”a313f7a6a587763c71eeb3cac498ca”,”)

    >>> artist.get_similar()

    and got a large list of similar artists! Success!

    Right now we are debating whether to use Python as a command-line interface or to switch to PhP to make a simple GUI application. Will has the most experience in PhP so he has taken the lead on trying to get a simple GUI up and running. Of course we will now have to download and install the last.fm API for PhP. While Will works on the PhP version, Kurt, Andrew and I are working on the Python command-line.

    To create our system, we’d like to create a menu with a list of options the user can choose:

    1. Input artist, output list of similar artists: artist.get_similar()
    2. Input artist, output top tracks: artist.get_top_tracks()
    3. Input track, output list of similar songs: track.getSimilar()

    So, we started our basic command-line menu system. Should be pretty simple right, especially with Python. First issue we ran into was our if-statement. Prompt the user for a menu item, and then go into that if-block. We created a simple test to see if we could get option 1 from above to work. User typed in a 1 and it wouldn’t go into the corresponding if-block! WHY! We did some research online and our syntax was correct. We figured out that the raw_input() method returns a String and we were testing the value of 1 as an integer! Problem solved, we got option 1 to work!

    Option 2 is not working as smoothly as the first. We keep getting errors for the method get_top_tracks(). Let me just say that the last.fm API is terrible! The syntax is wrong and the parameter values are off. The API for get_top_tracks() says it takes two arguments but when we input them into our program, we get a syntax error saying it only takes one parameter.

    Code so far:

    import pylast
    
    print "Welcome to the Team 3 Pylast Recommender!"
    print "To find an artist similar to your artist, press (1)"
    print "To find a list of top tracks by an artist, press (2)"
    print "To find a list of songs similar to a particular song, press (3)"
    print "To quit, press (4)"
    
    input = raw_input(">")
    print input
    
    if (input=="1"):
        print "Please enter the name of an Artist"
        similarArtistName = raw_input(">")
        similarArtist = pylast.Artist("similarArtistName", "bd46f9bce716e11a6d311d77c06d2159", "a313f7a6a587763c71eeb3cac498ca40", '')
        print similarArtist.get_similar()
    elif (input=="2"):
        print "Please enter the name of an Artist"
        trackArtistName = raw_input(">")
        trackArtist = pylast.Artist("trackArtistName", "bd46f9bce716e11a6d311d77c06d2159", "a313f7a6a587763c71eeb3cac498ca40", '')
        print pylast.Artist.get_top_tracks(trackArtist)

    Will demo’ed a great PhP recommendation system in class. He used these simple last.fm API methods:  Artist.get_similar(), Track.get_similar(), Geo.get_top_artist() and Geo.get_top_tracks(). Check it out here!

     

    Portfolio Assignment 2 February 4, 2009

    Filed under: Data Mining — Lauren @ 7:57 am

    Python

    I have had a very difficult time trying to get pydelicious to work. I’m using a Windows XP machine and have done multiple Google searches on “how to install pydelicious”. I find every entry confusing and not very helpful. I’ve even looked at blog entries from classmates and couldn’t follow their installations, whether a different operating system or not. I’m going to get together with a member of my team to work through it.

    Weka Part 1

    Although the sample dataset is included when Weka is downloaded, I wanted to go through the process of creating an ARFF file from Microsoft Excel. I created the weather dataset in Excel and saved it as a CSV file. Then I opened it in Notepad to add the ARFF file tags: @relation, @attribute and @data

    Weka Part 2

    After downloading the Cleveland Heart Disease dataset, I used the preprocess tab in the Weka Explorer to load the ARFF file. I ran the J48 decision tree from the classify tab and got this performance summary:

    === Summary ===

    Correctly Classified Instances         235               77.5578 %

    Incorrectly Classified Instances        68               22.4422 %

    Kappa statistic                          0.5443

    Mean absolute error                      0.1044

    Root mean squared error                  0.2725

    Relative absolute error                 52.0476 %

    Root relative squared error             86.5075 %

    Total Number of Instances              303