Lauren's Blog

stop and smell the roses

Portfolio Assignment 3 February 10, 2009

Filed under: Data Mining — Lauren @ 1:42 am

I worked on this assignment with my Team: Andrew Nelson, Kurt Koller and Will Boyd.

This first thing we did was to download the Last.fm API. Kurt signed up on Last.fm to register to use the API. We decided to use Python for the project so we downloaded and installed pylast from Google Code. Copying the files into the Python directory and then using the command import pylast in IDLE gave us no errors so the install was successful.

To begin our recommendation system, we started simple with getting similar artists to a given artist:

>>> artist = pylast.Artist(“The Black Keys”,”bd46f9bce716e11a6d311d77c06d2159″,”a313f7a6a587763c71eeb3cac498ca”,”)

>>> artist.get_similar()

and got a large list of similar artists! Success!

Right now we are debating whether to use Python as a command-line interface or to switch to PhP to make a simple GUI application. Will has the most experience in PhP so he has taken the lead on trying to get a simple GUI up and running. Of course we will now have to download and install the last.fm API for PhP. While Will works on the PhP version, Kurt, Andrew and I are working on the Python command-line.

To create our system, we’d like to create a menu with a list of options the user can choose:

  1. Input artist, output list of similar artists: artist.get_similar()
  2. Input artist, output top tracks: artist.get_top_tracks()
  3. Input track, output list of similar songs: track.getSimilar()

So, we started our basic command-line menu system. Should be pretty simple right, especially with Python. First issue we ran into was our if-statement. Prompt the user for a menu item, and then go into that if-block. We created a simple test to see if we could get option 1 from above to work. User typed in a 1 and it wouldn’t go into the corresponding if-block! WHY! We did some research online and our syntax was correct. We figured out that the raw_input() method returns a String and we were testing the value of 1 as an integer! Problem solved, we got option 1 to work!

Option 2 is not working as smoothly as the first. We keep getting errors for the method get_top_tracks(). Let me just say that the last.fm API is terrible! The syntax is wrong and the parameter values are off. The API for get_top_tracks() says it takes two arguments but when we input them into our program, we get a syntax error saying it only takes one parameter.

Code so far:

import pylast

print "Welcome to the Team 3 Pylast Recommender!"
print "To find an artist similar to your artist, press (1)"
print "To find a list of top tracks by an artist, press (2)"
print "To find a list of songs similar to a particular song, press (3)"
print "To quit, press (4)"

input = raw_input(">")
print input

if (input=="1"):
    print "Please enter the name of an Artist"
    similarArtistName = raw_input(">")
    similarArtist = pylast.Artist("similarArtistName", "bd46f9bce716e11a6d311d77c06d2159", "a313f7a6a587763c71eeb3cac498ca40", '')
    print similarArtist.get_similar()
elif (input=="2"):
    print "Please enter the name of an Artist"
    trackArtistName = raw_input(">")
    trackArtist = pylast.Artist("trackArtistName", "bd46f9bce716e11a6d311d77c06d2159", "a313f7a6a587763c71eeb3cac498ca40", '')
    print pylast.Artist.get_top_tracks(trackArtist)

Will demo’ed a great PhP recommendation system in class. He used these simple last.fm API methods:  Artist.get_similar(), Track.get_similar(), Geo.get_top_artist() and Geo.get_top_tracks(). Check it out here!

 

Portfolio Assignment 2 February 4, 2009

Filed under: Data Mining — Lauren @ 7:57 am

Python

I have had a very difficult time trying to get pydelicious to work. I’m using a Windows XP machine and have done multiple Google searches on “how to install pydelicious”. I find every entry confusing and not very helpful. I’ve even looked at blog entries from classmates and couldn’t follow their installations, whether a different operating system or not. I’m going to get together with a member of my team to work through it.

Weka Part 1

Although the sample dataset is included when Weka is downloaded, I wanted to go through the process of creating an ARFF file from Microsoft Excel. I created the weather dataset in Excel and saved it as a CSV file. Then I opened it in Notepad to add the ARFF file tags: @relation, @attribute and @data

Weka Part 2

After downloading the Cleveland Heart Disease dataset, I used the preprocess tab in the Weka Explorer to load the ARFF file. I ran the J48 decision tree from the classify tab and got this performance summary:

=== Summary ===

Correctly Classified Instances         235               77.5578 %

Incorrectly Classified Instances        68               22.4422 %

Kappa statistic                          0.5443

Mean absolute error                      0.1044

Root mean squared error                  0.2725

Relative absolute error                 52.0476 %

Root relative squared error             86.5075 %

Total Number of Instances              303     

 

Portfolio Assignment 1 January 25, 2009

Filed under: Data Mining — Lauren @ 12:44 am

After creating recommendations.py and running the commands on page 9 of “Collective Intelligence”, I got an error about recommendations not existing. I then re-read the page and moved recommendations.py to the Lib directory in Python. That fixed it right away. I love how easy Python makes it to use data structures like dictionaries and lists!

Euclidean Distance

Plugging in the Euclidean distance right into the Python interpreter (using IDLE) gave me the same answers as the example in the book with Toby and LaSalle. However, when I added the function sim_distance to recommendations.py I got a different answer for Lisa Rose and Gene Seymour. I added the squares of the differences by hand and got the same answer as my function. I think the general consensus is the book is wrong!

Pearson Coefficient

The Pearson coefficient worked correctly and yielded the same results as the book. It took me a while to understand how the function sim_pearson was operating like the formula we discussed in class but I worked through it.

Manhattan Distance

Implementing the Manhattan distance was pretty simple. I followed the same format as the sim_distance and sim_pearson functions. The formula for the Manhattan distance is |X1-X2|+|Y1-Y2|+…+|Z1-Z2|. I had to look up the syntax for an absolute value function in Python and it was what I thought it would be: abs(x). Below is my sim_manhattan function.

from math import sqrt

# Returns a distance-based similarity score for personA and personB

def sim_manhattan(prefs, personA, personB):

    # Get the list of shared_items

    si={}

    for item in prefs[personA]:

        if item in prefs[personB]:

            si[item]=1

    # if they have no ratings in common, return 0

    if len(si)==0: return 0

    # Add up the absolute values of all the differences

    sum_of_abs=sum([abs(prefs[personA][item]-prefs[personB][item])  for item in si])

    return sum_of_abs

When tested in the Python interpretor with the critics Lisa Rose and Gene Seymour, I got the following, correct result:

>>> reload(recommendations)

<module ‘recommendations’ from ‘C:\Python26\lib\recommendations.py’>

>>>recommendations.sim_manhattan(recommendations.critics,’Lisa Rose’, ‘Gene Seymour’)

4.5

 

CPSC 470: Data Mining January 23, 2009

Filed under: Data Mining — Lauren @ 5:51 pm

Spring 2009

A hands-on introductory course on data mining and information retrieval.

http://www.zacharski.org/classes/2009/spring/cs470u/index.php