I’ve been following the development of googleVis, the implementation of the Google Visualization API for R, for a bit now. The library has a lot of potential as a bridge between R (where data processing happens) and HTML (where presentation is [increasingly] happening). A growing number of visualization frameworks are on the market and all have their perks (e.g. Many Eyes, Simile, Flare). I guess I was inspired in such a way by the Hans Rosling Show TED talk that makes such great use of bubble charts that I wanted to try the Google Vis API for that chart type alone. There’s more, however, if you don’t care much for floating blubbles: neat chart variants include the geochart, area charts and the usual classics (bar, pie, etc). Check out the chart gallery for an overview.

So here are my internet growth charts:

(1) Motion chart showing the growth of the global internet population since 2000 for 208 countries

(2) World map showing global internet user statistics for 2009 for 208 countries

Data source: data.un.org (ITU database). I’ve merged two tables from the database into one (absolute numbers and percentages) and cleaned the data up a bit. The resulting tab-separated CSV file is available here.

And here’s the R code for rendering the chart. Basically you just replace gvisMotionChart() with gvisGeoChart() for the second chart, the rest is the same.

1
2
3
4
library("googleVis")
n <- read.csv("netstats.csv", sep="\t")
nmotion <- gvisMotionChart(n, idvar="Country", timevar="Year", options=list(width=1024, height=768))
plot(nmotion)
Tagged with:  

I meant to post this a month or so ago, when I was conducting my study of casual tweeting, but didn’t get to it. No harm in posting it now, I guess — code doesn’t go bad, fortunately.

Note: this requires Linux/Unix/OSX, Python 2.6 and the tweepy library. It might also work on Windows, but I haven’t checked.

1. Fetching a single user’s tweets with twitter_fetch.py

The purpose of the script below is to automatically retrieve all new tweets by one or more users, where “new” means all tweets that have been added since the last round of archiving. If the script is called for the first time for a given user, it will try to retrieve all available tweets for that person. It relies on the tweepy package for Python, which is one of a number of libraries providing access to the Twitter API. In case you’re looking for a library for R, check out twitteR.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import sys
import time
import os
import tweepy
 
# make sure that the directory 'Tweets' exists, this is
# where the tweets will be archived
wdir = 'Tweets'
user = sys.argv[1]
id_file = user + '.last_id'
timeline_file = user + '.timeline'
 
if os.path.exists(wdir + '/' + id_file):
	f = open(wdir + '/' + id_file, 'r')
	since = int(f.read())
	f.close()
	tweets = tweepy.api.user_timeline(user, since_id=since)
else:
	tweets = tweepy.api.user_timeline(user)
 
if len(tweets) > 0:
	last_id = str(tweets[0].id)
	tweets.reverse()
 
	# write tweets to file
	f = open(wdir + '/' + timeline_file, 'a+')
	for tweet in tweets:
		output = str(tweet.created_at) + '\t' + tweet.text.replace('\r', ' ').encode('utf-8') + '\t' + tweet.source.encode('utf-8') + '\n'
		f.write(output)
		print output
	f.close()
 
	# write last id to file
	f = open(wdir + '/' + id_file, 'w')
	f.write(last_id)
	f.close()
else:
	print 'No new tweets for ' + user

The code is pretty straight-forward. I wrote it without really knowing Python beyond the bare essentials and relying heavily on IPython‘s code completion. Actual retrieval of tweets happens in a single line:

tweets = tweepy.api.user_timeline(user)
 

The rest of the script is devoted to managing the data and making sure only new tweets are retrieved. This is done via the since_id parameter which is fed the last recorded id that has been saved to the user’s id file in the previous round of archiving. There are more elegant ways of doing this, but any improvements are up to you. ;-)

2. Fetching a bunch of different users’ tweets with twitter_fetch_all.sh

Second comes a very simple bash script. The only thing it does is call twitter_fetch.py once for each user in a list of people you want to track. Again, there are probably other ways of doing this, but I wanted to keep the functions of the two different scripts separate.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# This is will perform twitter_fetch,py on the twitter_users[] array. Add any number of twitter_users[NUMBER]="USER" lines below
# to archive additional accounts.
 
# --- twitter user list ---
twitter_users[0]="SomeUser"
twitter_users[1]="SomeOtherUser"
twitter_users[2]="YetAnotherUser"
twitter_users[3]="YouGetTheIdea"
 
# --- execute twitter_fetch.py ---
for twitter_user in ${twitter_users[*]}
do
	echo "Getting tweets for user $twitter_user"
	python twitter_fetch.py $twitter_user
	echo "Done."
	echo ""
done

You should place this in the same directory as twitter_fetch.py and modify it to suit your needs.

3. Automating the whole thing with a cronjob

Finally, here’s a cron directive I used to automate the process and log the result in case any errors occur. Read the linked Wikipedia article if you’re unfamiliar with cron, it’s a very convenient way of automating tasks on Linux/Unix.

0 * * * * sh /root/twitter_fetch_all.sh >/root/twitter_fetch.log
 

(Yes, I’m running this as root. Because I can. And because it’s an EC2 instance with nothing else on it anyway.)

Hope it’s useful to someone, let me know if you have any questions. :-)

Tagged with:  

For those interested, here are video and slides for my talk on social data at NEXT ’11 in Berlin last month.

You can also download the slides as a PDF here (CC-BY license).