CPSC 470 Zacharski

Thursday, April 23, 2009

Portfolio 9

Our final group presentation was on the topic of movie clustering. The group compared the results of clustering the list of movies from IMDB based on genre and then keywords. The clustering was very memory intensive and took quite a while to complete.

Here are some of the things the group found:

First, when the group clustered the movies based on genre they were some interesting results. The groups were much broader and general. Some of the clusters made sense, while other had no correlation whatsoever. Here are three pictures of what we found after some genre clustering:

This is a snippet of the, "Scarface," genre clustering:

This is a snippet of the war movie genre clustering:

This is a snippet of the, "Last Action Hero," genre clustering:

Now, the group decided to cluster the same list of movies based on the keywords associated with them rather than their genre. These took much longer but produced much more accurate results. Here some of the results:

Here are the results of the, "Scarface," keywords clustering:

Now, the results from the war movies keywords clustering:

And finally, the results from, "Shanghai Noon," keyword clustering:

As you can see by the pictures, the keyword clustering was much more specific and more accurate. The genre clustering was very general. One conclusion that can be made is that with clustering you are faced with an age old engineering trade-off. More effort with better results or less effort with okay results.

Portfolio 8

Our presentation last week was PCI Chapter 5, Optimization. We as a group broke the chapter down and each create slides based on our assigned sections. Once completed we got together and assembled the slides. There were some interesting things that we learned.

One of the most important pieces to an optimization problem is the cost function. The cost function is also the most difficult things to determine. An optimization problem tries to minimize the cost function.

There are a few methods of optimization. The first is Random Searching. Random Searching is just what the name implies. It is just random guessing. It is only really good for using as a baseline against other algorithms.

Hill Climbing is a variation of Random Searching. It finds its closest neighbors and finds the best value. It is only able to find the local minimum rather than the global one.

Simulated Annealing uses random searching to find an initial value. The algorithm looks for progressively better values. It is much more efficient when it comes to finding the global minimum.

Genetic Algorithms start with a set of random solutions. It takes the best solutions and the rest are considered modifications of the best ones. It continues over an over until no improvement is shown.

Time must also be spent deciding on how to represent the solution. Depending on the type of problem, the solution may vary on how it is represented.

Tuesday, April 7, 2009

Portfolio 7

The first step in going through chapter 4 in PCI was to create a small set of pages that will need to be indexed. The author has provided such as list at http://kiwitobes.com/wiki. This was accomplished with the following code:

>>> import urllib2
>>> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html')
>>> contents=c.read()
>>> print contents[0:50]

The actual crawler code we are going to use, uses the Beautiful Soup API. BeautifulSoup.py was very easily downloaded from the Beautiful website. The BeautifulSoup.py was put in my working directory and:

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib import urlopen
>>> soup=BeautifulSoup(urlopen('http://google.com'))
>>> soup.head.title
Google
>>> links=soup('a')
>>> len(links)
16
>>> links[0]
Images
>>> links[0].contents[0]
u'Images'

This of course just tests to make sure the BeautifulSoup.py works correctly.

Back to the crawling...I tried to find a website that would work for the crawler. I could not find a single .html website that would work. Here is an example of what I tried:

>>> pagelist=['http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm']
>>> crawler=searchengine.crawler('')
>>> crawler.crawl(pagelist)
Indexing http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm
Could not parse page http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm

I'm not sure what it is that I am doing wrong. I tried to find a simple html website and it still would not crawl.

The important thing to take away here is that the crawler is supposed to go through the website and index what it finds. Once indexed and stored the information can be queried and we can discover interesting things about that website. You can have content-based rankings where based on different criteria such as: Word Frequency, Document location, and Word distance are used to score the websites.

Once these scores are obtained it is common to normalize the scores so that they are easy to read and understand.

Tuesday, March 24, 2009

Portfolio 6

Portfolio 6 involves running hierarchical clustering on movie data from imdb.com. The python code I used I got from the PCI chapter 3 code. I took the .txt file containing all the movie data and pasted it into Excel and made some changes to the data. After that, I pasted it back into notepad and saved it as movies.txt.

I chose to cluster the movies in the .txt file by the rating they received. I may have made some kind of mistake because the dendogram was alphabetical. Even if this is the case I understand the concept behind hierarchical clustering. Which is to group the data into hierarchies and merger the two most similar. Repeating this until all data is clustered. This is how I got my dendogram.

I first ran the command:

rows,columns,data=clusters.readfile('C:\Python26\Lib\movies.txt')

Followed by:

clust=clusters.hcluster(data)

This command took quite a while to run. I started the operation and had to wait for around 30 minutes.

Hierachical clustering takes the data and bulids a hierarchy of groups. It then continuously merges the most similar groups. It repeats this until there is only one resulting group.

I then printed out the resulting dendogram:

clusters.printclust(clust,labels=rows)

This is a small section of the dendogram:

This is clearly incorrect.

I chose to try this again using different fields from the .txt file. I figured out that some fields just work better than others. I performed the same python commands as before, but this time on a file called movies4.txt. This file consisted of the movies, ratings they received, and then r1 (which I'm not sure what that means). This time I got a more convincing looking dendogram.

Here is a piece of that dendogram:

Thursday, February 26, 2009

Portfolio 5

Visualization done right

For the data visualization portfolio assignment, I chose a visualization that shows the amount of Internet country codes. The visualizations looks like this:

I chose this data visualization not because of some underlying trend that it discovers. But rather, the fact that the visualizations accomplishes its main task. It takes what would be numbers and transforms them into a picture and very understandable.

What you do see is Canada(CN) and India(IN) are the two most commonly used country codes. This is counter intuitive because most people would say the the United States should be. The .us country code is fairly popular but is not used as much because most of the United States uses other domains. Such as .com, .edu, and .org.

One surprise is that Nigeria(NG) is such a used country code. Another surprise would be how little the Japan(JP) country code is used.

This visualization does help us visualize the data, much more so than the numbers themselves.

Visualizing our own data

For my own dataset, I chose to compare Major League Baseball, American League teams. The dataset has players, teams, position, and salaries. The visualization for the data is pretty interesting. It is as follows:

This particular visualization shows teams in each pie, and the slices are the amount of salary paid to a possition. What is interesting is that most teams concentrate their funds on pitchers.

This was pretty fun to do. I had to look up the salaries of the players and make an excel spreadsheet accordingly. The formatting was pretty simple. Once I got a feel for how many-eyes wanted the data, the uploading was easy. My data could be represented by numerous visualizations. I found these pie charts the be the most visually beneficial.

The visualization could be interesting for players who were wondering what teams financial concerns were. A player could see that a team wants to pay their first basement a lot. If the player was a first basemen, he might be interested in that team. Also, if a team was scouting the opposition they could look at this visualization and see what that team's focus was.

Just for the heck of it, here is another visualization of the same data:

Wednesday, February 18, 2009

Portfolio 4

Part 1

So I have started working through the clustering program in chapter 3, and am having some difficulties. The text is pretty straightforward when it comes to content, but the code is a little confusing. Here is some of the code thus far:

import feedparser
import re

# Returns title and dictionary of the word counts for an rss feed
def getwordcounts(url):
# Parse the feed
d=feedparser.parse(url)
wc={}

# Loop over all the entries
for e in d.entries:
if 'summary' in e: summary=e.summary
else: summary=e.description

# Extract a list of words
words=getwords(e.title+' '+summary)
for word in words:
wc.setdefault(word,0)
wc[word]+=1
return d.feed.title,wc

def getwords(html):
# Remove all the HTML tags
txt=re.compile(r'<[^>]+>').sub('',html)

# Split words by all non-alpha characters
words=re.compile(r'[^A-Z^a-z]+').split(txt)

# Convert to lowercase
return [word.lower() for word in words if word!='']

apcount={}
wordcounts={}
feedlist=[]
for feedurl in file('C:\Python26\Lib\feedlist.txt'):
feedlist.add(feedurl)
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
for word, count in wc.items():
apcount.setdefault(word,0)
if count>1:
apcount[word]+=1

Up to here the code compiles fine. After that I am having issues with the file "feedlist.txt".

Part 2

This is a screen shot of the Titanic dataset. This particular graph shows the clustering of passengers who survived and those who did not. The X axis is sex of the passenger, the Y axis is the class of the passenger (1st, 2nd, 3rd, and crew), and blue is survived and red is did not survive. The data shows something that most people already know, but it is still interesting seeing the clustering of the data proving the point. That being, most men died and almost all of women 1st class passengers survived.

You can change what the X and Y axis are, to help better understand the data. The Titanic example is simple, but it still can show how visualizations can help us better understand data.

Lets look at another way we can cluster this data:

Here we compared the age and sex of the passenger, and whether or not they survived. What makes this more interesting is the fact that a fair amount of female children did not survive. The upper-right area is female child passengers, and red (did not survive) is the dominant color in that area. You can manipulate the visualizations in many ways, in order to discover interesting trends in the data set.

Part 3

For the visualizations portion of the assignment, I chose to look at a dataset in many-eyes regarding countries' fertility rate. The graph looks something like this:

Each differently colored line is a different country. There is an obvious decrease in fertility rate over the last 30+ years. So looking at this information very generally one would assume that fertility rates in every country are down. But this is not the case. Let's cut out some of the countries, and focus on some of the ones that go against the trend.

Here, we isolated Denmark and Finland. Denmark seems to start out high and the take a huge dip, and then it starts to recover. Finland, has remained steady over the last 30+ years. What is also interesting to note is that Finland and Denmark are both very close geographically. They are both in NE Europe. This being said, look at the last 10-15 years. The two countries' fertility rates seem to follow the same trend. In 1981 Denmark's fertility rate hits rock bottom at 1.8. After some research, I found that during the time period leading up to 1981, Denmark women were involved in longer enrollment in the education system. This shows us how data can give us a broad overview of many countries, or a short period in one country's history.

Tuesday, February 10, 2009

Portfolio 3

This portfolio assignment first appeared to be pretty daunting. But, after actually going the the last.fm API, it became a lot more clear. As a group we first wrote a simple recommendation program in Python. It work fine and the code was really quite simple. All the methods and classes that one would need were all in the API, and easy to user. After this success the group decided to try and do something a little more graphical. So a recommendation programm in C# was written. This program is much more flashy, and still is a bug-free as the original Python program. The group will more than likely present the C# program to the class. The Python code we wrote as a group is as follows:

import pylast
key = 'b8a9a83c0e60d30d237eea0dcdcf055a'
secret = 'a0a206e56d580a78e8715155106371fa'
sk = ''
bandName=input('Please enter band name:')

artist = pylast.Artist(bandName, key, secret, sk)
track = pylast.Track(bandName,"45", key, secret, sk)
similar = artist.get_similar()
track2 = track.get_similar()
me = pylast.User("trumpetporvida", key, secret, sk)
me.compare_with_user(

print similar

tracks = artist.get_top_tracks()

print "\n"

print tracks

print "\n"

print track2

As I said before the Python program worked just fine. Luke decided to take very similar code a design a program in C#. We chose to present this one as it was prettier to look at.

Some of the classes in the last.fm API are:

Album
Artist
Playlist
Tag

Some of the methods in the API were:

artist.search
tag.getSimilar
track.addTags
library.addAlbum

The system can take user input in various text boxes and find similar items to that input. For example, if a user inputs The Beatles for an artist name, the system will go through the last.fm dataset and return all items it found to be similar. The system can then find similar tracks by that artist based again on user input.