After recently teaching an introductory class to R aimed at linguists at the University of Bayreuth, I’ve decided to put my extended notes on the website in the form of a very basic tutorial. Check out Corpus and Text Linguistics with R (CTL-R) if you want to learn R fundamentals and have no prior programming experience. It’s still incomplete at present, but I hope to have more chapters ready soon. Happy R-ing! :-)

Tagged with:  

Those of you following my occasional updates here know that I have previously posted code for graphing Twitter friend/follower networks using R (post #1. post #2). Kai Heinrich was kind enough to send me some updated code for doing so using a newer version of the extremely useful twitteR package. His very crisp, yet thoroughly documented script is pasted below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Script for graphing Twitter friends/followers
# by Kai Heinrich (kai.heinrich@mailbox.tu-dresden.de) 
 
# load the required packages
 
library("twitteR")
library("igraph")
 
# HINT: In order for the tkplot() function to work on mac you need to install 
#       the TCL/TK build for X11 
#       (get it here: http://cran.us.r-project.org/bin/macosx/tools/)
#
# Get User Information with twitteR function getUSer(), 
#  instead of using ur name you can do this with any other username as well 
 
start<-getUser("YOUR_USERNAME") 
 
# Get Friends and Follower names with first fetching IDs (getFollowerIDs(),getFriendIDs()) 
and then looking up the names (lookupUsers()) 
 
friends.object<-lookupUsers(start$getFriendIDs())
follower.object<-lookupUsers(start$getFollowerIDs())
 
# Retrieve the names of your friends and followers from the friend
# and follower objects. You can limit the number of friends and followers by adjusting the 
# size of the selected data with [1:n], where n is the number of followers/friends 
# that you want to visualize. If you do not put in the expression the maximum number of 
# friends and/or followers will be visualized.
 
n<-20 
friends <- sapply(friends.object[1:n],name)
followers <- sapply(followers.object[1:n],name)
 
# Create a data frame that relates friends and followers to you for expression in the graph
relations <- merge(data.frame(User='YOUR_NAME', Follower=friends), 
data.frame(User=followers, Follower='YOUR_NAME'), all=T)
 
# Create graph from relations.
g <- graph.data.frame(relations, directed = T)
 
# Assign labels to the graph (=people's names)
V(g)$label <- V(g)$name
 
# Plot the graph using plot() or tkplot(). Remember the HINT at the 
# beginning if you are using MAC OS/X
tkplot(g)
Tagged with:  

I’ve been following the development of googleVis, the implementation of the Google Visualization API for R, for a bit now. The library has a lot of potential as a bridge between R (where data processing happens) and HTML (where presentation is [increasingly] happening). A growing number of visualization frameworks are on the market and all have their perks (e.g. Many Eyes, Simile, Flare). I guess I was inspired in such a way by the Hans Rosling Show TED talk that makes such great use of bubble charts that I wanted to try the Google Vis API for that chart type alone. There’s more, however, if you don’t care much for floating blubbles: neat chart variants include the geochart, area charts and the usual classics (bar, pie, etc). Check out the chart gallery for an overview.

So here are my internet growth charts:

(1) Motion chart showing the growth of the global internet population since 2000 for 208 countries

(2) World map showing global internet user statistics for 2009 for 208 countries

Data source: data.un.org (ITU database). I’ve merged two tables from the database into one (absolute numbers and percentages) and cleaned the data up a bit. The resulting tab-separated CSV file is available here.

And here’s the R code for rendering the chart. Basically you just replace gvisMotionChart() with gvisGeoChart() for the second chart, the rest is the same.

1
2
3
4
library("googleVis")
n <- read.csv("netstats.csv", sep="\t")
nmotion <- gvisMotionChart(n, idvar="Country", timevar="Year", options=list(width=1024, height=768))
plot(nmotion)
Tagged with:  

Extracting comments from a Blogger.com blog post with R

On February 20, 2011, in Code, by cornelius

Note #1: Check out this very useful post by Najko Jahn describing how to extract links to blogs via Google Blog Search.

Note #2: I’ll update the code below once I find the time using Najko’s cleaner XPath-based solution.

Recently I’ve been working with comments as part of the project on science blogging we’re doing at the Junior Researchers Group “Science and the Internet”. I wrote the script below to quickly extract comments from Atom feeds, such as those generated by Blogger.com.

The code isn’t exactly pretty, mostly because I didn’t use an XML parser to properly read the data, instead resorting to brute-force pattern matching, but it gets the job done. Two easier (and cleaner) routes would have been to a) get the data directly from the Google Data API (doesn’t work as far as I can tell, since there seems to be no implementation for R*) or b) parse the data specifically as Atom (doesn’t work as — annoyingly — there is no specific parsing support for Atom in R). Properly parsing the XML, while not rocket science, seemed more complex than necessary to me, especially given the fact that Atom should be common enough.

Scraping, by the way, makes for a very nice exercise for a pragmatic programming class (the one you might teach in the Digital Humanities or Information Science), since you teach people how to get their hands on data they can then use as part of their own projects.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
rm(list=ls(all=T));
library("RCurl");
 
rounds <- 3;
perpage <- 100;
feedurl <- "http://rrresearch.blogspot.com/feeds/2171542729230739732/comments/default";
 
for (i in 1:rounds) {
	thisurl <- paste(feedurl, "?start-index=", ((i - 1) * perpage + 1), "&max-results=", perpage, sep="");
	if (exists("feeddata")==T) feeddata <- c(feeddata, getURL(thisurl)) else feeddata <- getURL(thisurl);
}
 
buffer <- paste(feeddata, collapse=" ");
 
entries <- unlist(strsplit(buffer, "<entry>"));
entries <- gsub("</feed>.*?$", "", entries);
entries <- entries[-1];
 
# get rid of quotes, excess whitespace etc
entries <- gsub("\n", "", entries, perl=T);
entries <- gsub("&amp;#39;", "\'", entries, perl=T);
entries <- gsub("&amp;quot;", "\"", entries, perl=T);
entries <- gsub("(&lt;br /&gt;)+", " ", entries, perl=T);
entries <- gsub("&lt;", "<", entries, perl=T);
entries <- gsub("&gt;", ">", entries, perl=T);
 
# extract date, author and text of comments
dates <- gsub("^<id>.*?<published>([0-9T:\\.-]{29,})</published>.*?</entry>(</feed>)?$", "\\1", entries, perl=T);
dates <- paste(substr(dates, 1, 10), substr(dates, 12, 19));
dates.px <- as.POSIXct(dates, origin="1970-01-01", tz="GMT-1");
dates.f <- strftime(dates.px, "%d %b %H:%M");
users <- gsub("^<id>.*?<name>(.*?)</name>.*?</entry>(</feed>)?$", "\\1", entries, perl=T);
comments <- gsub("^<id>.*?<content type='html'>(.*?)</content>.*?</entry>(</feed>)?$", "\\1", entries, perl=T);
posters <- sort(table(users), decreasing=T);
 
d <- data.frame(date=dates.f, user=users, comment=comments);
 
# write two tables, one containing all the comments and the other a simple frequency list
write.csv(d, file="blog-comments.csv");
write.csv(posters, "blog-posters.csv");

* I spoke a bit too soon there. There is an implementation for Google Data with R, but it doesn’t support Blogger.com and many other interesting services. Hopefully such an implementation will be provided eventually. That, or I just quit whining and learn Python…

Tagged with:  

Dynamic Twitter graphs with R and Gephi (clip and code)

On January 2, 2011, in Code, by cornelius

Note 1/4/11: I’ve updated the code below after discovering a few bugs.

Back in October when Jean Burgess first posted a teaser of the Gephi dynamic graph feature applied to Twitter data, I thought right away that this was going to bring Twitter visualization to an entirely new level. When you play around with graph visualizations for a while you inevitably come to the conclusion that they are of very limited use for studying something like Twitter because of it’s dynamicity as a ongoing communicative process. Knowing that someone retweeted someone else a lot or that a certain word occured many times is only half the story. When someone got a lot of retweets or some word was used frequently is often much more interesting.

Anyhow, Axel Bruns posted a first bit of code (for generating GEXF files) back in October, followed by a detailed implementation (1, 2) a few days ago. Since Axel uses Gawk and I prefer R, the first thing I did was to write an R port of Axel’s code. It does the following:

  1. Extract all tweets containing @-messages and retweets from a Twapperkeeper hashtag archive.
  2. Generate a table containing the fields sender, recipient, start time and end time for each data point.
  3. Write this table to a GEXF file.

The implementation as such wasn’t difficult and I didn’t really follow Axel’s code too closely, since R is syntactically different from Gawk. The thing I needed to figure out was the logic of the GEXF file, specifically start and end times, in order to make sure that edges decay over time. Axel explains this in detail in his post and provides a very thorough and clean implementation.

My own implementation is rougher and probably still needs polishing is several places, but here’s a first result (no sound; watch in HD and fullscreen):

Note 1/7/11: I’ve replaced the clip above with a better one after ironing out a few issues with my script. The older clip is still available here.

Like previous visualizations I’ve done, this also uses the #MLA09 data, i.e. tweets from the 2009 convention of the Modern Language Association.
1/7/11: The newer clip is based on data from Digital Humanities 2010 (#dh2010).

And here’s the R code for generating the GEXF file, in case you want to play around with it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
rm(list=ls(all=T));
 
outfile.gexf <- "dh2010.gexf";
decaytime = 3600;
buffer = 0;
eid = 1;
 
tweets <- read.csv(file.choose(), head=T, sep="|", quote="", fileEncoding="UTF-8");
ats <- tweets[grep("@([a-z0-9_]{1,15}):?", tweets$text),];
g.from <- tolower(as.character(ats$from_user))
g.to <- tolower(gsub("^.*@([a-z0-9_]{1,15}):?.*$", "\\1", ats$text, perl=T));
g.start <- ats$time - min(ats$time) + buffer;
g.end <- ats$time - min(ats$time) + decaytime + buffer;
g <- data.frame(from=g.from[], to=g.to[], start=g.start[], end=g.end[]);
g <- g[order(g$from, g$to, g$start),];
output <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<gexf xmlns=\"http://www.gexf.net/1.2draft\" version=\"1.2\">\n<graph mode=\"dynamic\" defaultedgetype=\"directed\" start=\"0\" end=\"", max(g$end) + decaytime, "\">\n<edges>\n", sep ="");
all.from <- as.character(unique(g$from));
for (i in 1:length(all.from))
{
	this.from <- all.from[i];
	this.to <- as.character(unique(g$to[grep(this.from, g$from)]));
	for (j in 1:length(this.to))
	{
		all.starts <- g$start[intersect(grep(this.from, g$from), grep(this.to[j], g$to))];
		all.ends <- g$end[intersect(grep(this.from, g$from), grep(this.to[j], g$to))];
		output <- paste(output, "<edge id=\"", eid, "\" source=\"", this.from, "\" target=\"", this.to[j], "\" start=\"", min(all.starts), "\" end=\"", max(all.ends), "\">\n<attvalues>\n", sep="");
		for (k in 1:length(all.starts))
		{	
			# overlap
			# if (all.starts[k+1] < all.ends[k]) output <- paste(output, "", sep=""); ... ?
			output <- paste(output, "\t<attvalue for=\"0\" value=\"1\" start=\"", all.starts[k], "\" />\n", sep="");
		}
		output <- paste(output, "</attvalues>\n<slices>\n", sep="");		
		for (l in 1:length(all.starts))
		{
			output <- paste(output, "\t<slice start=\"", all.starts[l], "\" end=\"", all.ends[l], "\" />\n", sep="");
		}
		output <- paste(output, "</slices>\n</edge>\n", sep="");	
		eid = eid + 1;
	}
} 
output <- paste(output, "</edges>\n</graph>\n</gexf>\n", sep = "");
cat(output, file=outfile.gexf);
Tagged with:  

After recently discovering the excellent methods section on mappingonlinepublics.net, I decided it was time to document my own approach to Twitter data. I’ve been messing around with R and igraph for a while, but it wasn’t until I discovered Gephi that things really moved forward. R/igraph are great for preprocessing the data (not sure how they compare with Awk), but rather cumbersome to work with when it comes to visualization. Last week, I posted a first Gephi visualization of retweeting at the Free Culture Research Conference and since then I’ve experimented some more (see here and here). #FCRC was a test case for a larger study that examines how academics use Twitter at conferences, which is part of what we’re doing at the junior researchers group Science and the Internet at the University of Düsseldorf (sorry, website is currently in German only).

Here’s a step-by-step description of how those graphs were created.

Step #1: Get tweets from Twapperkeeper
Like Axel, I use Twapperkeeper to retrieve tweets tagged with the hashtag I’m investigating. This has several advantages:

  • it’s possible to retrieve older tweets which you won’t get via the API
  • tweets are stored as CSV rather than XML which makes them easier to work with for our purposes.

The sole disadvatage of Twapperkeeper is that we have to rely on the integrity of their archive — if for some reason not all tweets with our hastag have been retrieved, we won’t know. Also, certain information is not retained in Twapperkeepers’ CSV files that is present in Twitter’s XML (e.g. geolocation) that we might be interested in.

Instructions:

  1. Search for the hashtag you’re interested in (e.g. #FCRC). If no archive exists, create one.
  2. Go to the archive’s Twapperkeeper page, sign into Twitter (button at the top) and then choose export and download at the bottom of the page
  3. Choose the pipe character (“|”) as seperator. I use that one rather than the more conventional comma or semicolon because we are dealing with text data which is bound to contain these characters a lot. Of course the pipe can also be parsed incorrectly, so be sure to have a look at the graph file you make.
  4. Voila. You should now have a CSV file containing tweets on your hard drive. Edit:Actually, you have a .tar file that contains the tweets. Look inside the .tar for a file with a very long name ending with “-1″ (not “info”) — that’s the data we’re looking for.

Step #2: Turn CSV data into a graph file with R and igraph
R is an open source statistics package that is primarily used via the command line. It’s absolutely fantastic at slicing and dicing data, although the syntax is a bit quirky and the documentation is somewhat geared towards experts (=statisticians). igraph is an R package for constructing and visualizing graphs. It’s great for a variety of purposes, but due to the command line approach of R, actually drawing graphs with igraph was somewhat difficult for me. But, as outlined below, Gephi took care of that. Running the code below in R will transform the CSV data into a GraphML file which can then be visualized with Gephi. While R and igraph rock at translating the data into another format, Gephi is the better tool for the actual visualization.

Instructions:

  1. Download and install R.
  2. In the R console, run the following: install.packages(igraph);
  3. Copy the CSV you’ve just downloaded from Twapperkeeper to an empty directory and rename it to tweets.csv.
  4. Finally, save the R file below to the same folder as the CSV and run it.

Code for extracting RTs and @s from a Twapperkeeper CSV file and saving the result in the GraphML format:

?Download tweetgraph.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Extract @-message and RT graphs from conference tweets
library(igraph);
 
# Read Twapperkeeper CSV file
tweets <- read.csv("tweets.csv", head=T, sep="|", quote="", fileEncoding="UTF-8");
print(paste("Read ", length(tweets$text), " tweets.", sep=""));
 
# Get @-messages, senders, receivers
ats <- grep("^\\.?@[a-z0-9_]{1,15}", tolower(tweets$text), perl=T, value=T);
at.sender <- tolower(as.character(tweets$from_user[grep("^\\.?@[a-z0-9_]{1,15}", tolower(tweets$text), perl=T)]));
at.receiver <- gsub("^\\.?@([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", ats, perl=T);
print(paste(length(ats), " @-messages from ", length(unique(at.sender)), " senders and ", length(unique(at.receiver)), " receivers.", sep=""));
 
# Get RTs, senders, receivers
rts <- grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T, value=T);
rt.sender <- tolower(as.character(tweets$from_user[grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T)]));
rt.receiver <- gsub("^rt @([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", rts, perl=T);
print(paste(length(rts), " RTs from ", length(unique(rt.sender)), " senders and ", length(unique(rt.receiver)), " receivers.", sep=""));
 
# This is necessary to avoid problems with empty entries, usually caused by encoding issues in the source files
at.sender[at.sender==""] <- "<NA>";
at.receiver[at.receiver==""] <- "<NA>";
rt.sender[rt.sender==""] <- "<NA>";
rt.receiver[rt.receiver==""] <- "<NA>";
 
# Create a data frame from the sender-receiver information
ats.df <- data.frame(at.sender, at.receiver);
rts.df <- data.frame(rt.sender, rt.receiver);
 
# Transform data frame into a graph
ats.g <- graph.data.frame(ats.df, directed=T);
rts.g <- graph.data.frame(rts.df, directed=T);
 
# Write sender -> receiver information to a GraphML file
print("Write sender -> receiver table to GraphML file...");
write.graph(ats.g, file="ats.graphml", format="graphml");
write.graph(rts.g, file="rts.graphml", format="graphml");

Step #3: Visualize graph with Gephi
Once you’ve completed steps 1 and 2, simply open your GraphML file(s) with Gephi. You should see a visualization of the graph. I won’t give an in-depth description of how Gephi works, but the users section of gephi.org has great tutorials which explain both Gephi and graph visualization in general really well.

I’ll post more on the topic as I make further progress, for example with stuff like dynamic graphs which show change in the network over time.

Tagged with: