Dynamic Twitter graphs with R and Gephi (clip and code)

On January 2, 2011, in Code, by cornelius

Note 1/4/11: I’ve updated the code below after discovering a few bugs.

Back in October when Jean Burgess first posted a teaser of the Gephi dynamic graph feature applied to Twitter data, I thought right away that this was going to bring Twitter visualization to an entirely new level. When you play around with graph visualizations for a while you inevitably come to the conclusion that they are of very limited use for studying something like Twitter because of it’s dynamicity as a ongoing communicative process. Knowing that someone retweeted someone else a lot or that a certain word occured many times is only half the story. When someone got a lot of retweets or some word was used frequently is often much more interesting.

Anyhow, Axel Bruns posted a first bit of code (for generating GEXF files) back in October, followed by a detailed implementation (1, 2) a few days ago. Since Axel uses Gawk and I prefer R, the first thing I did was to write an R port of Axel’s code. It does the following:

  1. Extract all tweets containing @-messages and retweets from a Twapperkeeper hashtag archive.
  2. Generate a table containing the fields sender, recipient, start time and end time for each data point.
  3. Write this table to a GEXF file.

The implementation as such wasn’t difficult and I didn’t really follow Axel’s code too closely, since R is syntactically different from Gawk. The thing I needed to figure out was the logic of the GEXF file, specifically start and end times, in order to make sure that edges decay over time. Axel explains this in detail in his post and provides a very thorough and clean implementation.

My own implementation is rougher and probably still needs polishing is several places, but here’s a first result (no sound; watch in HD and fullscreen):

Note 1/7/11: I’ve replaced the clip above with a better one after ironing out a few issues with my script. The older clip is still available here.

Like previous visualizations I’ve done, this also uses the #MLA09 data, i.e. tweets from the 2009 convention of the Modern Language Association.
1/7/11: The newer clip is based on data from Digital Humanities 2010 (#dh2010).

And here’s the R code for generating the GEXF file, in case you want to play around with it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
rm(list=ls(all=T));
 
outfile.gexf <- "dh2010.gexf";
decaytime = 3600;
buffer = 0;
eid = 1;
 
tweets <- read.csv(file.choose(), head=T, sep="|", quote="", fileEncoding="UTF-8");
ats <- tweets[grep("@([a-z0-9_]{1,15}):?", tweets$text),];
g.from <- tolower(as.character(ats$from_user))
g.to <- tolower(gsub("^.*@([a-z0-9_]{1,15}):?.*$", "\\1", ats$text, perl=T));
g.start <- ats$time - min(ats$time) + buffer;
g.end <- ats$time - min(ats$time) + decaytime + buffer;
g <- data.frame(from=g.from[], to=g.to[], start=g.start[], end=g.end[]);
g <- g[order(g$from, g$to, g$start),];
output <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<gexf xmlns=\"http://www.gexf.net/1.2draft\" version=\"1.2\">\n<graph mode=\"dynamic\" defaultedgetype=\"directed\" start=\"0\" end=\"", max(g$end) + decaytime, "\">\n<edges>\n", sep ="");
all.from <- as.character(unique(g$from));
for (i in 1:length(all.from))
{
	this.from <- all.from[i];
	this.to <- as.character(unique(g$to[grep(this.from, g$from)]));
	for (j in 1:length(this.to))
	{
		all.starts <- g$start[intersect(grep(this.from, g$from), grep(this.to[j], g$to))];
		all.ends <- g$end[intersect(grep(this.from, g$from), grep(this.to[j], g$to))];
		output <- paste(output, "<edge id=\"", eid, "\" source=\"", this.from, "\" target=\"", this.to[j], "\" start=\"", min(all.starts), "\" end=\"", max(all.ends), "\">\n<attvalues>\n", sep="");
		for (k in 1:length(all.starts))
		{	
			# overlap
			# if (all.starts[k+1] < all.ends[k]) output <- paste(output, "", sep=""); ... ?
			output <- paste(output, "\t<attvalue for=\"0\" value=\"1\" start=\"", all.starts[k], "\" />\n", sep="");
		}
		output <- paste(output, "</attvalues>\n<slices>\n", sep="");		
		for (l in 1:length(all.starts))
		{
			output <- paste(output, "\t<slice start=\"", all.starts[l], "\" end=\"", all.ends[l], "\" />\n", sep="");
		}
		output <- paste(output, "</slices>\n</edge>\n", sep="");	
		eid = eid + 1;
	}
} 
output <- paste(output, "</edges>\n</graph>\n</gexf>\n", sep = "");
cat(output, file=outfile.gexf);
Tagged with:  

7 Responses to Dynamic Twitter graphs with R and Gephi (clip and code)

  1. Axel Bruns says:

    Hi Cornelius,

    very interesting… One thing I’m not quite clear about, though (I’m not very familiar with R syntax): does your script deal with variable edge weights – does the weight of an edge increase if multiple @replies from one user to another are active at the same time?

    Axel

  2. cornelius says:

    Hi Axel,

    thanks for stopping by and for pointing out a bug in my code. :-)

    The initial implementation I posted was flawed regarding the period for which edges are shown. Since Gephi treats edge weights cumulatively (i.e. redeclaring the same relation will stack weight), I thought I might take the easy route and simply redeclare the same relationship multiple times, letting Gephi figure out the respective edge weight and whether or not a particular edge is shown at all. This seemed to work at first: edges will show at their designated start and end times, but only for the first set of values given. In my buggy implementation, edges appear only a single time and get the cumulative weight of *all* their occurances for that period. :-(

    I’ve spent most of today fixing this and edges now appear and reappear correctly. I have not dealt with edge operlaps yet. It’s a shame slice weights don’t stack — if they did, things would be a bit simpler. It’s also annoying how node ids are handled (I mean specifically how labels and ids are treated). Defining all nodes and their respective start and end times would make them appear and disappear as they enter and leave conversations, which would make the graph more readable.

    Anyhow, enough hacking for today — hopefully I’ll be able to take care of the edge overlap weight thing tomorrow…

  3. [...] article was mentioned in Dynamic Twitter graphs with R and Gephi (clip and code) as an interesting example of “aging” [...]

  4. [...] this older post for more information on how to visualize dynamic graphs of retweets with [...]

  5. [...] Next on the to do list is: – automate the production of archive reports – work in the time component so we can view behaviour over time in Gephi… (here’s a starting point maybe, again from Cornelius Puschmann’s blog: Dynamic Twitter graphs with R and Gephi (clip and code)) [...]

  6. Plotti says:

    Hi Cornelius,

    Great idea! I’ve included the dynamic graph export for gephi in my twitter tool too. If you are into ruby you might like it:twitterresearcher.wordpress.com

  7. [...] If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>