PCI Chapter 3
Installation déjà vu. Thank goodness my Dad loves Python! I downloaded feedparser and found myself in the same pickle as pydelicious. I called up my Dad and he said I have to add Python to the Windows path so it can recognize the ‘python‘ command. Just in case anyone else has been having this problem, here is a helpful site. I added ;C:\Python;C:\Python\Scripts.
While trying to simple set up the dataset to cluster, I got a lot of errors from copying word-for-word the code from the book. I looked up the unofficial errata for the book and there seem to be many problems with page 32! With help from Professor Zacharski, I got the code working and it provided a nice output file blogdata.txt.
The hierarchical cluster code was a little tough to follow because of all the syntax but I understand the overall purpose of hcluster. The printclust method is a neat way to output the results. I found a search engine cluster that includes:
- John Battelle’s Searchblog
- The Official Google Blog
- Search Engine Watch Blog
- Google Operating System
- Search Engine Roundtable
Downloading and installing PIL was surprisingly easy! I got the test.jpg file to work from Appendix A! I figure understanding the code used in the PIL library is not essential for this class (more essential for a graphics class) so I just got the methods to work for the clusters and it outputed a pretty jpeg file for the dendrogram for the blogs.
Switching the columns and rows in blogdata.txt and creating a new dendrogram showing word clusters took a long time to generate because there are more words than blogs. As I’m waiting for the jpeg image to render, I’m sure the picture will be huge. Viewing the image at 3% gets the entire image into the window. I’m not going to upload this picture but I’ll crop out an interesting word cluster I found:
I like the idea of k-means clustering better than hierarchical clustering because you can define the number of clusters to form beforehand. This is more ideal in the real world because it allows you to form groups based on the results you would like to expect. I played around with changing the number of centroids. First I did k=10, like the book. Then I tried k=14 and k=4. At first I thought the clusters would contain an equal amount of entries but that wasn’t the case:
>>> [blognames[r] for r in kclust]
[“John Battelle’s Searchblog”, ‘Giga Omni Media, Inc.’, ‘Google Operating System’, ‘Gawker: Valleywag’, ‘Gizmodo’, ‘Lifehacker’, ‘Slashdot’, ‘Search Engine Watch Blog’, ‘Schneier on Security’, ‘Search Engine Roundtable’, ‘TechCrunch’, ‘mezzoblue’, ‘Matt Cutts: Gadgets, Google, and SEO’, ‘The Official Google Blog’, ‘Bloglines | News’, ‘Quick Online Tips’]
>>> [blognames[r] for r in kclust]
[‘Joystiq’, ‘Download Squad’, ‘Engadget’, ‘Crooks and Liars’, “SpikedHumor – Today’s Videos and Pictures”, ‘The Unofficial Apple Weblog (TUAW)’, ‘Wired Top Stories’]
I’m not familiar with most of the blogs, but in cluster 0, there are a lot of search engine blogs which is similar to the heirarchical clustering. Below is a cluster from the Zebo dataset that I found was interesting. It combines the desire to have human interaction as well as material wealth.
Of all the data mining techniques we have studied so far, I think clustering is the most useful tool in the real world. There are a lot of datasets that could be clustered to find useful information for marketing as well as for fun.