Dynamic Twitter graphs with R and Gephi (clip and code)

On January 2, 2011, in Code, by cornelius

Note 1/4/11: I’ve updated the code below after discovering a few bugs.

Back in October when Jean Burgess first posted a teaser of the Gephi dynamic graph feature applied to Twitter data, I thought right away that this was going to bring Twitter visualization to an entirely new level. When you play around with graph visualizations for a while you inevitably come to the conclusion that they are of very limited use for studying something like Twitter because of it’s dynamicity as a ongoing communicative process. Knowing that someone retweeted someone else a lot or that a certain word occured many times is only half the story. When someone got a lot of retweets or some word was used frequently is often much more interesting.

Anyhow, Axel Bruns posted a first bit of code (for generating GEXF files) back in October, followed by a detailed implementation (1, 2) a few days ago. Since Axel uses Gawk and I prefer R, the first thing I did was to write an R port of Axel’s code. It does the following:

  1. Extract all tweets containing @-messages and retweets from a Twapperkeeper hashtag archive.
  2. Generate a table containing the fields sender, recipient, start time and end time for each data point.
  3. Write this table to a GEXF file.

The implementation as such wasn’t difficult and I didn’t really follow Axel’s code too closely, since R is syntactically different from Gawk. The thing I needed to figure out was the logic of the GEXF file, specifically start and end times, in order to make sure that edges decay over time. Axel explains this in detail in his post and provides a very thorough and clean implementation.

My own implementation is rougher and probably still needs polishing is several places, but here’s a first result (no sound; watch in HD and fullscreen):

Note 1/7/11: I’ve replaced the clip above with a better one after ironing out a few issues with my script. The older clip is still available here.

Like previous visualizations I’ve done, this also uses the #MLA09 data, i.e. tweets from the 2009 convention of the Modern Language Association.
1/7/11: The newer clip is based on data from Digital Humanities 2010 (#dh2010).

And here’s the R code for generating the GEXF file, in case you want to play around with it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
rm(list=ls(all=T));
 
outfile.gexf <- "dh2010.gexf";
decaytime = 3600;
buffer = 0;
eid = 1;
 
tweets <- read.csv(file.choose(), head=T, sep="|", quote="", fileEncoding="UTF-8");
ats <- tweets[grep("@([a-z0-9_]{1,15}):?", tweets$text),];
g.from <- tolower(as.character(ats$from_user))
g.to <- tolower(gsub("^.*@([a-z0-9_]{1,15}):?.*$", "\\1", ats$text, perl=T));
g.start <- ats$time - min(ats$time) + buffer;
g.end <- ats$time - min(ats$time) + decaytime + buffer;
g <- data.frame(from=g.from[], to=g.to[], start=g.start[], end=g.end[]);
g <- g[order(g$from, g$to, g$start),];
output <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<gexf xmlns=\"http://www.gexf.net/1.2draft\" version=\"1.2\">\n<graph mode=\"dynamic\" defaultedgetype=\"directed\" start=\"0\" end=\"", max(g$end) + decaytime, "\">\n<edges>\n", sep ="");
all.from <- as.character(unique(g$from));
for (i in 1:length(all.from))
{
	this.from <- all.from[i];
	this.to <- as.character(unique(g$to[grep(this.from, g$from)]));
	for (j in 1:length(this.to))
	{
		all.starts <- g$start[intersect(grep(this.from, g$from), grep(this.to[j], g$to))];
		all.ends <- g$end[intersect(grep(this.from, g$from), grep(this.to[j], g$to))];
		output <- paste(output, "<edge id=\"", eid, "\" source=\"", this.from, "\" target=\"", this.to[j], "\" start=\"", min(all.starts), "\" end=\"", max(all.ends), "\">\n<attvalues>\n", sep="");
		for (k in 1:length(all.starts))
		{	
			# overlap
			# if (all.starts[k+1] < all.ends[k]) output <- paste(output, "", sep=""); ... ?
			output <- paste(output, "\t<attvalue for=\"0\" value=\"1\" start=\"", all.starts[k], "\" />\n", sep="");
		}
		output <- paste(output, "</attvalues>\n<slices>\n", sep="");		
		for (l in 1:length(all.starts))
		{
			output <- paste(output, "\t<slice start=\"", all.starts[l], "\" end=\"", all.ends[l], "\" />\n", sep="");
		}
		output <- paste(output, "</slices>\n</edge>\n", sep="");	
		eid = eid + 1;
	}
} 
output <- paste(output, "</edges>\n</graph>\n</gexf>\n", sep = "");
cat(output, file=outfile.gexf);
Tagged with:  

A new, simpler approach to Twitter visualization

On November 27, 2010, in Uncategorized, by cornelius

If you’ve been following my work recently you might have noticed my interest slight obsession with visualization, especially in relation to communication on Twitter. I’ve been experimenting both with graphs and with traditional bar and pie charts to show what happens when people use Twitter.

Now I’ve tried something new, somewhat inspired by an essay on info visualization recently published by Lev Manovich. In it, Manovich describes an approach that he calls direct visualization:

In direct visualization, the data is reorganized into a new visual representation that preserves its original form. Usually, this does involve some data transformation such as changing data size. For instance, text cloud reduces the size of text to a small number of most frequently used words. However, this is a reduction that is quantitative rather than qualitative. We don’t substitute media objects by new objects (i.e. graphical primitives typically used in infovis), which only communicate selected properties of these objects (for instance, bars of different lengths representing word frequencies). My phrase “visualization without reduction” refers to this preservation of a much richer set of properties of data objects when we create visualizations directly from them.

Applying this idea to the Twitter data I work with, I decided to try something new. Instead of reducing the richness of the data, why not rearrange it to make it more readable? And here’s the result of my attempt to do that:

All tweets using the #MLA09 hashtags in one large PDF

(Note: download the PDF and look at it in your favorite PDF viewer if the zooming in Scribd is sluggish)

Tagged with:  

After recently discovering the excellent methods section on mappingonlinepublics.net, I decided it was time to document my own approach to Twitter data. I’ve been messing around with R and igraph for a while, but it wasn’t until I discovered Gephi that things really moved forward. R/igraph are great for preprocessing the data (not sure how they compare with Awk), but rather cumbersome to work with when it comes to visualization. Last week, I posted a first Gephi visualization of retweeting at the Free Culture Research Conference and since then I’ve experimented some more (see here and here). #FCRC was a test case for a larger study that examines how academics use Twitter at conferences, which is part of what we’re doing at the junior researchers group Science and the Internet at the University of Düsseldorf (sorry, website is currently in German only).

Here’s a step-by-step description of how those graphs were created.

Step #1: Get tweets from Twapperkeeper
Like Axel, I use Twapperkeeper to retrieve tweets tagged with the hashtag I’m investigating. This has several advantages:

  • it’s possible to retrieve older tweets which you won’t get via the API
  • tweets are stored as CSV rather than XML which makes them easier to work with for our purposes.

The sole disadvatage of Twapperkeeper is that we have to rely on the integrity of their archive — if for some reason not all tweets with our hastag have been retrieved, we won’t know. Also, certain information is not retained in Twapperkeepers’ CSV files that is present in Twitter’s XML (e.g. geolocation) that we might be interested in.

Instructions:

  1. Search for the hashtag you’re interested in (e.g. #FCRC). If no archive exists, create one.
  2. Go to the archive’s Twapperkeeper page, sign into Twitter (button at the top) and then choose export and download at the bottom of the page
  3. Choose the pipe character (“|”) as seperator. I use that one rather than the more conventional comma or semicolon because we are dealing with text data which is bound to contain these characters a lot. Of course the pipe can also be parsed incorrectly, so be sure to have a look at the graph file you make.
  4. Voila. You should now have a CSV file containing tweets on your hard drive. Edit:Actually, you have a .tar file that contains the tweets. Look inside the .tar for a file with a very long name ending with “-1″ (not “info”) — that’s the data we’re looking for.

Step #2: Turn CSV data into a graph file with R and igraph
R is an open source statistics package that is primarily used via the command line. It’s absolutely fantastic at slicing and dicing data, although the syntax is a bit quirky and the documentation is somewhat geared towards experts (=statisticians). igraph is an R package for constructing and visualizing graphs. It’s great for a variety of purposes, but due to the command line approach of R, actually drawing graphs with igraph was somewhat difficult for me. But, as outlined below, Gephi took care of that. Running the code below in R will transform the CSV data into a GraphML file which can then be visualized with Gephi. While R and igraph rock at translating the data into another format, Gephi is the better tool for the actual visualization.

Instructions:

  1. Download and install R.
  2. In the R console, run the following: install.packages(igraph);
  3. Copy the CSV you’ve just downloaded from Twapperkeeper to an empty directory and rename it to tweets.csv.
  4. Finally, save the R file below to the same folder as the CSV and run it.

Code for extracting RTs and @s from a Twapperkeeper CSV file and saving the result in the GraphML format:

?Download tweetgraph.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Extract @-message and RT graphs from conference tweets
library(igraph);
 
# Read Twapperkeeper CSV file
tweets <- read.csv("tweets.csv", head=T, sep="|", quote="", fileEncoding="UTF-8");
print(paste("Read ", length(tweets$text), " tweets.", sep=""));
 
# Get @-messages, senders, receivers
ats <- grep("^\\.?@[a-z0-9_]{1,15}", tolower(tweets$text), perl=T, value=T);
at.sender <- tolower(as.character(tweets$from_user[grep("^\\.?@[a-z0-9_]{1,15}", tolower(tweets$text), perl=T)]));
at.receiver <- gsub("^\\.?@([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", ats, perl=T);
print(paste(length(ats), " @-messages from ", length(unique(at.sender)), " senders and ", length(unique(at.receiver)), " receivers.", sep=""));
 
# Get RTs, senders, receivers
rts <- grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T, value=T);
rt.sender <- tolower(as.character(tweets$from_user[grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T)]));
rt.receiver <- gsub("^rt @([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", rts, perl=T);
print(paste(length(rts), " RTs from ", length(unique(rt.sender)), " senders and ", length(unique(rt.receiver)), " receivers.", sep=""));
 
# This is necessary to avoid problems with empty entries, usually caused by encoding issues in the source files
at.sender[at.sender==""] <- "<NA>";
at.receiver[at.receiver==""] <- "<NA>";
rt.sender[rt.sender==""] <- "<NA>";
rt.receiver[rt.receiver==""] <- "<NA>";
 
# Create a data frame from the sender-receiver information
ats.df <- data.frame(at.sender, at.receiver);
rts.df <- data.frame(rt.sender, rt.receiver);
 
# Transform data frame into a graph
ats.g <- graph.data.frame(ats.df, directed=T);
rts.g <- graph.data.frame(rts.df, directed=T);
 
# Write sender -> receiver information to a GraphML file
print("Write sender -> receiver table to GraphML file...");
write.graph(ats.g, file="ats.graphml", format="graphml");
write.graph(rts.g, file="rts.graphml", format="graphml");

Step #3: Visualize graph with Gephi
Once you’ve completed steps 1 and 2, simply open your GraphML file(s) with Gephi. You should see a visualization of the graph. I won’t give an in-depth description of how Gephi works, but the users section of gephi.org has great tutorials which explain both Gephi and graph visualization in general really well.

I’ll post more on the topic as I make further progress, for example with stuff like dynamic graphs which show change in the network over time.

Tagged with:  

Post-event Twitter stats for #THATcamp

On May 26, 2010, in data, by cornelius

I thought I’d post an updated version of the simple stats on Twitter activity presented here. The data in the older post was collected before THATcamp took place, the graphs below show the activity during and after the camp.

The tweets I’ve collected are also available here (my own file) and on TwapperKeeper.

Tweets over time (roughly 14th of May to 24th)

Most active users

Most @-messaged users

Most retweeted users

Tagged with:  

URLs tweeted at #THATCamp (all 230 of them)

On May 24, 2010, in data, by cornelius

I’ve data-mined the #thatcamp hashtag a bit more and extracted all 230 links that were tweeted recently (also includes some of THATCamp Paris). Enjoy :-)

(or go here to view the table inside Google Docs)

Tagged with:  

Edit: I’ve posted an updated version of the script here. It is not quite as compressed as Anatol’s version, but I think it’s a decent compromise between readability and efficiency. :-)

Edit #2 And yet another update, this one contributed by Kai Heinrich.

I hacked together some code for R last night to visualize a Twitter graph (=who you are following and who is following you) that I briefly showed at the session on visualizing text today at THATCamp and that I wanted to share. My comments in the code are very basic and there is much to improve, but in the spirit of “release early, release often”, I think it’s better to get it out there right away.

Ingredients:

Note that packages are most easily installed with the install.packages() function inside of R, so R is really the only thing you need to download initially.

Code:

# Load twitteR package
library(twitteR)

# Load igraph package
library(igraph)


# Set up friends and followers as vectors. This, along with some stuff below, is not really necessary, but the result of my relative inability to deal with the twitter user object in an elegant way. I'm hopeful that I will figure out a way of shortening this in the future

friends <- as.character()
followers <- as.character()

# Start an Twitter session. Note that the user through whom the session is started doesn't have to be the one that your search for in the next step. I'm using myself (coffee001) in the code below, but you could authenticate with your username and then search for somebody else.

sess <- initSession('coffee001', 'mypassword')

# Retrieve a maximum of 500 friends for user 'coffee001'.

friends.object <- userFriends('coffee001', n=500, sess)

# Retrieve a maximum of 500 followers for 'coffee001'. Note that retrieving many/all of your followers will create a very busy graph, so if you are experimenting it's better to start with a small number of people (I used 25 for the graph below).

followers.object <- userFollowers('coffee001', n=500, sess)

# This code is necessary at the moment, but only because I don't know how to slice just the "name" field for friends and followers from the list of user objects that twitteR retrieves. I am 100% sure there is an alternative to looping over the objects, I just haven't found it yet. Let me know if you do...

for (i in 1:length(friends.object))
{
friends <- c(friends, friends.object[[i]]@name);
}


for (i in 1:length(followers.object))
{
followers <- c(followers, followers.object[[i]]@name);
}


# Create data frames that relate friends and followers to the user you search for and merge them.

relations.1 <- data.frame(User='Cornelius', Follower=friends)
relations.2 <- data.frame(User=followers, Follower='Cornelius')
relations <- merge(relations.1, relations.2, all=T)

# Create graph from relations.

g <- graph.data.frame(relations, directed = T)

# Assign labels to the graph (=people's names)

V(g)$label <- V(g)$name

# Plot the graph.

plot(g)

For the screenshot below I've used the tkplot() method instead of plot(), which allows you to move around and highlight elements interactively with the mouse after plotting them. The graph only shows 20 people in order to keep the complexity manageable.

Tagged with:  

One week of Scientwist tweeting (January 18 – 25)

On January 27, 2010, in data, by cornelius

Here’s a list of URLs and hashtags that were popular among the @scientwists community last week. I realize that this is just a long enumeration, but I’m planning to publish these stats in a more concise format in the near future.

January 18th
http://phylogenomics.blogspot.com/2010/01/top-11-things-i-learned-at-science.html
http://deepseanews.com/2010/01/miriam-joins-us-at-dsn/
http://www.guardian.co.uk/science/2010/jan/18/running-brain-memory-cell-growth
#scio10
#Biotechnology
#hcsm

January 19th
http://trueslant.com/ryansager/2010/01/18/science-reporting-gone-wild/
http://timesonline.typepad.com/science/2010/01/science-on-the-bbc.html
http://friendfeed.com/brembs/177a01db/bertrand-russell-on-god-1959
#scio10
#Biotechnology
#ten23

January 20th
http://www.shortyawards.com/
http://www.ustream.tv/channel/nada-importa
http://friendfeed.com/jcbradley/0a46ac22/science-online-2010-thoughts
#scio10
#health
#technology

January 21st
http://www.popsci.com/science/article/2010-01/five-reasons-henrietta-lacks-most-important-woman-medical-history
http://phylogenomics.blogspot.com/2010/01/enough-w-good-here-are-top10-problems-w.html
http://www.newscientist.com/article/dn18423-viruses-use-hive-intelligence-to-focus-their-attack.html
#scio10
#technology
#ten23

January 22nd
http://fc07.deviantart.net/fs19/f/2007/248/a/f/dna_strand_corset_32_piercings_by_mizuzinkaholik.jpg
http://friendfeed.com/danielmietchen/cbfc448b/collaborative-futures-3-mike-linksvayer
http://scienceblogs.com/bookoftrogool/2010/01/scientists_why_your_access_to.php
#scio10
#corporateeyesontheprize
#technology

January 23rd
http://www.badscience.net/2010/01/12-monkeys-no-8-wait-sorry-i-meant-14/
http://www.ustream.tv/channel/aw8
http://friendfeed.com/pansapiens/212fde9c/you-know-your-research-is-original-when
#scio10
#3wordsconservativeshate
#FF

January 24th
http://www.shortyawards.com/
http://featuresblogs.chicagotribune.com/printers-row/2010/01/eureka-great-discoveries-in-new-science-books.html
http://friendfeed.com/science-2-0/3124a7c3/looking-for-help-on-building-list-of-social-web
#3wordsconservativeshate
#retailpolitics
#scio10

January 25th
http://iambiotech.org/2010/01/25/biotech-roundup-monday-january-25th/?utm_source=hootsuite&utm_medium=tweet&utm_content=roundup&utm_campaign=hootsuite
http://friendfeed.com/mfenner/04c40a1a/scientists-and-librarians-friend-or-foe
http://blogs.telegraph.co.uk/technology/markchangizi/100004573/do-ant-colonies-have-something-in-common-with-the-human-body/
#scio10
#hcsm
#science

Tagged with:  

Since starting the Scientwists Project a bit over a week ago, I’ve been busy hacking up Bash and R scripts in order to analyze the data produced by the 500+ scholars that I’m following. Here’s a first glimpse of what they’ve been tweeting about, specifically the URLs and hashtags they’ve used.

In total, I’ve collected about 12.000 tweets since January 7th, containing 4.750 different URLs and 1.130 different hashtags.

10 most popular URLs

1. The Shorty Awards

2. Dennis Meadows: The Oil Drum: Economics and Limits to Growth: What’s Sustainable?

3. Björn Brembs: Social filtering of scientific information – a view beyond Twitter

4. BioData Product Blog: Laboratory Notebooks: A thing of the past?

5. Forbes.com: Illumina’s Cheap New Gene Machine

6. A photograph of clouds that seem to resemble Great Britain :-)

7. Times Online: Baroness Greenfield loses her job in Royal Institution shake-up

8. Mr. Gunn: Cell launches a new format for the presentation of research articles online

9. Daniel Mietchen: On the need for a global academic internet platform [ref to Nadja Kutz: arxiv.org/abs/0803.1360]

10. Rebecca Skloot: The Immortal Life of Henrietta Lacks

These were tweeted between 5 (#9 and #10) and 30 (#1) times. However, tracking URLs is complicated by the fact that many different addresses may point to the same source, especially since people use a variety of different URL shorteners. This is something I’ll resolve later, so for now this fairly anecdotal.

15 most popular hashtags

1. #scio10 (391x)
2. #scidebate (84x)
3. #fb (75x)
4. #science (68x)
5. #technology (67x)
6. #tcot (58x)
7. #orca (54x)
8. #debateanatel (53x)
9. #Glee (31x)
10. #ff (27x)
11. #HeLa (26x)
12. #uksnow (26x)
13. #Haiti (25x)
14. #NetDE (24x)
15. #gov20 (21x)

Obviously some of these are automatically generated (#fb and #ff), but there’s a fair share of interesting ones. I’m expecting #scio10 will dominate the next few days even more visibly.

Hope it’s informative – let me know if you have any questions. :-)

Microblogging services such as Twitter and FriendFeed appear to be steadily gaining popularity among academics for work-related purposes (communication at conferences, discussion of publications, casual conversation). As part of a larger project on the evolution of scholarly communication I am today launching a study of academic uses of Twitter across disciplines.

One component of this study will be a corpus of tweets by international scholars from different fields over the course of one year. This corpus will be assembled via the account @scientwists, an automated user controlled via the Twitter API, and made available in the public domain after completion. The @scientwists account will follow a list of scholars put together from several sources, starting with this list assembled by David Bradley.*

The corpus will be anonymized, i.e. user names will not be legible. It will also be possible to exclude individual posts from the corpus via use of the hashtag #exclude. However, if you receive a notification that @scientwists is following you and you would prefer for your tweets not to be included in the corpus at all, please simply block @scientwists.

If you have questions or suggestions, please be sure to contact me on Twitter or via email.

- Cornelius Puschmann, Heinrich-Heine-Universität Düsseldorf (about me)

Note: if you are not an academic and are being followed by @scientwists2 you have been randomly included in the control group for this study. Please block @scientwists2 if you prefer your tweets not to be used.

Tagged with:  

danah boyd on Twitter

On August 16, 2009, in Thoughts, by cornelius

Just read a spot-on blog post by danah boyd on how Twitter communication is frequently misinterpreted by laymen:

Far too many tech junkies and marketers are obsessed with Twitter becoming the next news outlet source. As a result, the press are doing what they did with blogging: hyping Twitter us as this amazing source of current events and dismissing it as pointless babble. Haven’t we been there, done that? Scott Rosenberg even wrote the book on it!

Yes, absolutely. We’ve been there and this is really just a rehash of the “relevance debate” we already had with blogging and that will probably stay with us for a long time. Communicating publicly used to be a privileged only enjoyed by a select few and bound to very clear codes and conventions. Now that the barriers have been removed, we are faced with the shocking revelation that other people do not, in fact, communicate primarily with us in mind. Duh.

I do however, disagree with danah regarding one minor point. People who seriously assign the category “pointless babble” to certain Twitter messages (based on what criterion, exactly?) are not researchers, they are “researchers” and they don’t produce studies, they produce “studies”. That’s why, in spite of all well-deserved skepticism, I think academia – ivory tower, arcane rituals and all – is a good thing. Because, for the most part, we try to figure out what’s really going on using actual data vs. simply telling people what they want to hear and then publishing the results in a glossy “report”.

Tagged with: