Part 1So I have started working through the clustering program in chapter 3, and am having some difficulties. The text is pretty straightforward when it comes to content, but the code is a little confusing. Here is some of the code thus far:
import feedparser
import re
# Returns title and dictionary of the word counts for an rss feed
def getwordcounts(url):
# Parse the feed
d=feedparser.parse(url)
wc={}
# Loop over all the entries
for e in d.entries:
if 'summary' in e: summary=e.summary
else: summary=e.description
# Extract a list of words
words=getwords(e.title+' '+summary)
for word in words:
wc.setdefault(word,0)
wc[word]+=1
return d.feed.title,wc
def getwords(html):
# Remove all the HTML tags
txt=re.compile(r'<[^>]+>').sub('',html)
# Split words by all non-alpha characters
words=re.compile(r'[^A-Z^a-z]+').split(txt)
# Convert to lowercase
return [word.lower() for word in words if word!='']
apcount={}
wordcounts={}
feedlist=[]
for feedurl in file('C:\Python26\Lib\feedlist.txt'):
feedlist.add(feedurl)
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
for word, count in wc.items():
apcount.setdefault(word,0)
if count>1:
apcount[word]+=1
Up to here the code compiles fine. After that I am having issues with the file "feedlist.txt".
Part 2
This is a screen shot of the Titanic dataset. This particular graph shows the clustering of passengers who survived and those who did not. The X axis is sex of the passenger, the Y axis is the class of the passenger (1st, 2nd, 3rd, and crew), and blue is survived and red is did not survive. The data shows something that most people already know, but it is still interesting seeing the clustering of the data proving the point. That being, most men died and almost all of women 1st class passengers survived.

You can change what the X and Y axis are, to help better understand the data. The Titanic example is simple, but it still can show how visualizations can help us better understand data.
Lets look at another way we can cluster this data:

Here we compared the age and sex of the passenger, and whether or not they survived. What makes this more interesting is the fact that a fair amount of female children did not survive. The upper-right area is female child passengers, and red (did not survive) is the dominant color in that area. You can manipulate the visualizations in many ways, in order to discover interesting trends in the data set.
Part 3
For the visualizations portion of the assignment, I chose to look at a dataset in many-eyes regarding countries' fertility rate. The graph looks something like this:

Each differently colored line is a different country. There is an obvious decrease in fertility rate over the last 30+ years. So looking at this information very generally one would assume that fertility rates in every country are down. But this is not the case. Let's cut out some of the countries, and focus on some of the ones that go against the trend.

Here, we isolated Denmark and Finland. Denmark seems to start out high and the take a huge dip, and then it starts to recover. Finland, has remained steady over the last 30+ years. What is also interesting to note is that Finland and Denmark are both very close geographically. They are both in NE Europe. This being said, look at the last 10-15 years. The two countries' fertility rates seem to follow the same trend. In 1981 Denmark's fertility rate hits rock bottom at 1.8. After some research, I found that during the time period leading up to 1981, Denmark women were involved in longer enrollment in the education system. This shows us how data can give us a broad overview of many countries, or a short period in one country's history.