Dynamic Twitter graphs with R and Gephi (clip and code)

On January 2, 2011, in Code, by cornelius

Note 1/4/11: I’ve updated the code below after discovering a few bugs.

Back in October when Jean Burgess first posted a teaser of the Gephi dynamic graph feature applied to Twitter data, I thought right away that this was going to bring Twitter visualization to an entirely new level. When you play around with graph visualizations for a while you inevitably come to the conclusion that they are of very limited use for studying something like Twitter because of it’s dynamicity as a ongoing communicative process. Knowing that someone retweeted someone else a lot or that a certain word occured many times is only half the story. When someone got a lot of retweets or some word was used frequently is often much more interesting.

Anyhow, Axel Bruns posted a first bit of code (for generating GEXF files) back in October, followed by a detailed implementation (1, 2) a few days ago. Since Axel uses Gawk and I prefer R, the first thing I did was to write an R port of Axel’s code. It does the following:

  1. Extract all tweets containing @-messages and retweets from a Twapperkeeper hashtag archive.
  2. Generate a table containing the fields sender, recipient, start time and end time for each data point.
  3. Write this table to a GEXF file.

The implementation as such wasn’t difficult and I didn’t really follow Axel’s code too closely, since R is syntactically different from Gawk. The thing I needed to figure out was the logic of the GEXF file, specifically start and end times, in order to make sure that edges decay over time. Axel explains this in detail in his post and provides a very thorough and clean implementation.

My own implementation is rougher and probably still needs polishing is several places, but here’s a first result (no sound; watch in HD and fullscreen):

Note 1/7/11: I’ve replaced the clip above with a better one after ironing out a few issues with my script. The older clip is still available here.

Like previous visualizations I’ve done, this also uses the #MLA09 data, i.e. tweets from the 2009 convention of the Modern Language Association.
1/7/11: The newer clip is based on data from Digital Humanities 2010 (#dh2010).

And here’s the R code for generating the GEXF file, in case you want to play around with it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
rm(list=ls(all=T));
 
outfile.gexf <- "dh2010.gexf";
decaytime = 3600;
buffer = 0;
eid = 1;
 
tweets <- read.csv(file.choose(), head=T, sep="|", quote="", fileEncoding="UTF-8");
ats <- tweets[grep("@([a-z0-9_]{1,15}):?", tweets$text),];
g.from <- tolower(as.character(ats$from_user))
g.to <- tolower(gsub("^.*@([a-z0-9_]{1,15}):?.*$", "\\1", ats$text, perl=T));
g.start <- ats$time - min(ats$time) + buffer;
g.end <- ats$time - min(ats$time) + decaytime + buffer;
g <- data.frame(from=g.from[], to=g.to[], start=g.start[], end=g.end[]);
g <- g[order(g$from, g$to, g$start),];
output <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<gexf xmlns=\"http://www.gexf.net/1.2draft\" version=\"1.2\">\n<graph mode=\"dynamic\" defaultedgetype=\"directed\" start=\"0\" end=\"", max(g$end) + decaytime, "\">\n<edges>\n", sep ="");
all.from <- as.character(unique(g$from));
for (i in 1:length(all.from))
{
	this.from <- all.from[i];
	this.to <- as.character(unique(g$to[grep(this.from, g$from)]));
	for (j in 1:length(this.to))
	{
		all.starts <- g$start[intersect(grep(this.from, g$from), grep(this.to[j], g$to))];
		all.ends <- g$end[intersect(grep(this.from, g$from), grep(this.to[j], g$to))];
		output <- paste(output, "<edge id=\"", eid, "\" source=\"", this.from, "\" target=\"", this.to[j], "\" start=\"", min(all.starts), "\" end=\"", max(all.ends), "\">\n<attvalues>\n", sep="");
		for (k in 1:length(all.starts))
		{	
			# overlap
			# if (all.starts[k+1] < all.ends[k]) output <- paste(output, "", sep=""); ... ?
			output <- paste(output, "\t<attvalue for=\"0\" value=\"1\" start=\"", all.starts[k], "\" />\n", sep="");
		}
		output <- paste(output, "</attvalues>\n<slices>\n", sep="");		
		for (l in 1:length(all.starts))
		{
			output <- paste(output, "\t<slice start=\"", all.starts[l], "\" end=\"", all.ends[l], "\" />\n", sep="");
		}
		output <- paste(output, "</slices>\n</edge>\n", sep="");	
		eid = eid + 1;
	}
} 
output <- paste(output, "</edges>\n</graph>\n</gexf>\n", sep = "");
cat(output, file=outfile.gexf);
Tagged with: