After recently discovering the excellent methods section on mappingonlinepublics.net, I decided it was time to document my own approach to Twitter data. I’ve been messing around with R and igraph for a while, but it wasn’t until I discovered Gephi that things really moved forward. R/igraph are great for preprocessing the data (not sure how they compare with Awk), but rather cumbersome to work with when it comes to visualization. Last week, I posted a first Gephi visualization of retweeting at the Free Culture Research Conference and since then I’ve experimented some more (see here and here). #FCRC was a test case for a larger study that examines how academics use Twitter at conferences, which is part of what we’re doing at the junior researchers group Science and the Internet at the University of Düsseldorf (sorry, website is currently in German only).

Here’s a step-by-step description of how those graphs were created.

Step #1: Get tweets from Twapperkeeper
Like Axel, I use Twapperkeeper to retrieve tweets tagged with the hashtag I’m investigating. This has several advantages:

  • it’s possible to retrieve older tweets which you won’t get via the API
  • tweets are stored as CSV rather than XML which makes them easier to work with for our purposes.

The sole disadvatage of Twapperkeeper is that we have to rely on the integrity of their archive — if for some reason not all tweets with our hastag have been retrieved, we won’t know. Also, certain information is not retained in Twapperkeepers’ CSV files that is present in Twitter’s XML (e.g. geolocation) that we might be interested in.

Instructions:

  1. Search for the hashtag you’re interested in (e.g. #FCRC). If no archive exists, create one.
  2. Go to the archive’s Twapperkeeper page, sign into Twitter (button at the top) and then choose export and download at the bottom of the page
  3. Choose the pipe character (“|”) as seperator. I use that one rather than the more conventional comma or semicolon because we are dealing with text data which is bound to contain these characters a lot. Of course the pipe can also be parsed incorrectly, so be sure to have a look at the graph file you make.
  4. Voila. You should now have a CSV file containing tweets on your hard drive. Edit:Actually, you have a .tar file that contains the tweets. Look inside the .tar for a file with a very long name ending with “-1″ (not “info”) — that’s the data we’re looking for.

Step #2: Turn CSV data into a graph file with R and igraph
R is an open source statistics package that is primarily used via the command line. It’s absolutely fantastic at slicing and dicing data, although the syntax is a bit quirky and the documentation is somewhat geared towards experts (=statisticians). igraph is an R package for constructing and visualizing graphs. It’s great for a variety of purposes, but due to the command line approach of R, actually drawing graphs with igraph was somewhat difficult for me. But, as outlined below, Gephi took care of that. Running the code below in R will transform the CSV data into a GraphML file which can then be visualized with Gephi. While R and igraph rock at translating the data into another format, Gephi is the better tool for the actual visualization.

Instructions:

  1. Download and install R.
  2. In the R console, run the following: install.packages(igraph);
  3. Copy the CSV you’ve just downloaded from Twapperkeeper to an empty directory and rename it to tweets.csv.
  4. Finally, save the R file below to the same folder as the CSV and run it.

Code for extracting RTs and @s from a Twapperkeeper CSV file and saving the result in the GraphML format:

?Download tweetgraph.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Extract @-message and RT graphs from conference tweets
library(igraph);
 
# Read Twapperkeeper CSV file
tweets <- read.csv("tweets.csv", head=T, sep="|", quote="", fileEncoding="UTF-8");
print(paste("Read ", length(tweets$text), " tweets.", sep=""));
 
# Get @-messages, senders, receivers
ats <- grep("^\\.?@[a-z0-9_]{1,15}", tolower(tweets$text), perl=T, value=T);
at.sender <- tolower(as.character(tweets$from_user[grep("^\\.?@[a-z0-9_]{1,15}", tolower(tweets$text), perl=T)]));
at.receiver <- gsub("^\\.?@([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", ats, perl=T);
print(paste(length(ats), " @-messages from ", length(unique(at.sender)), " senders and ", length(unique(at.receiver)), " receivers.", sep=""));
 
# Get RTs, senders, receivers
rts <- grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T, value=T);
rt.sender <- tolower(as.character(tweets$from_user[grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T)]));
rt.receiver <- gsub("^rt @([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", rts, perl=T);
print(paste(length(rts), " RTs from ", length(unique(rt.sender)), " senders and ", length(unique(rt.receiver)), " receivers.", sep=""));
 
# This is necessary to avoid problems with empty entries, usually caused by encoding issues in the source files
at.sender[at.sender==""] <- "<NA>";
at.receiver[at.receiver==""] <- "<NA>";
rt.sender[rt.sender==""] <- "<NA>";
rt.receiver[rt.receiver==""] <- "<NA>";
 
# Create a data frame from the sender-receiver information
ats.df <- data.frame(at.sender, at.receiver);
rts.df <- data.frame(rt.sender, rt.receiver);
 
# Transform data frame into a graph
ats.g <- graph.data.frame(ats.df, directed=T);
rts.g <- graph.data.frame(rts.df, directed=T);
 
# Write sender -> receiver information to a GraphML file
print("Write sender -> receiver table to GraphML file...");
write.graph(ats.g, file="ats.graphml", format="graphml");
write.graph(rts.g, file="rts.graphml", format="graphml");

Step #3: Visualize graph with Gephi
Once you’ve completed steps 1 and 2, simply open your GraphML file(s) with Gephi. You should see a visualization of the graph. I won’t give an in-depth description of how Gephi works, but the users section of gephi.org has great tutorials which explain both Gephi and graph visualization in general really well.

I’ll post more on the topic as I make further progress, for example with stuff like dynamic graphs which show change in the network over time.

Tagged with:  

22 Responses to Generating graphs of retweets and @-messages on Twitter using R and Gephi

  1. Tal Galili says:

    Great post!
    Any chance for you to add your R tag:
    http://blog.ynada.com/tag/r
    To R-bloggers?
    http://www.r-bloggers.com/add-your-blog/

    (p.s: add “subscribe to comments” plugin :)
    And maybe also this:
    http://www.r-statistics.com/2010/10/wp-codebox-a-better-r-syntax-highlighter-plugin-for-wordpress/
    )

  2. Dave says:

    Hi,

    I copied the code, installed R on my Mac and saved a Twapperkeeper archive in .csv with pipe as the delimiter. I made the code above into an .R file and “executed” it in the R gui. The code seems to run, but nothing is output. I’m not a programmer, so I’m probably missing out a crucial step! Are you able to give a brief description of how to get your code into R and then run it?

  3. cornelius says:

    @Tal: thanks a bunch for your suggestions. I’ve just added myself to r-bloggers.com and will look into the other things as well.

    @Dave: my bad, there are a few (small) things I forgot — I’ve just corrected the code and sent you an email.

  4. Dave says:

    Hi again,

    This seems to work fine. Maybe just a couple of bugs?

    1) Sometimes whole tweets are extracted as nodes
    2) Can the code cope with a CSV of ~11MB in size? It runs, but doesn’t extract any nodes/edges.

    Thanks!

    Dave.

  5. cornelius says:

    Hi Dave,

    1) Hmm this is probably related to issues with the CSV file. Have you checked if the pipe character occurs anywhere where it shouldn’t occur (i.e. in the offending tweets)?
    2) Size shouldn’t be a problem as such. Renderding a huge graph with Gephi may cause issues though. Perhaps you want to try working with something smaller, at least initially.
    Sorry if that’s not very helpful, but hard for me to tell what is wrong from here. :-)

    Cornelius

  6. Dave says:

    Thanks for your responses :)

    I’ll do some more tests on smaller files and see what happens.

    Thanks for your help.

    Dave.

  7. Mitch says:

    Cornelius,

    Thanks for this tutorial! My experience with R is pretty limited so I have run into a problem, and it’s probably poor execution on my part.

    After I load igraph and run your script I get the follow error:

    ERROR: object ‘rts.g’ not found.

    This occurs when it attempts to call the write.graph function.

    Have you run into this issue before, or is it obvious what I might have done wrong? Any insight would be extremely helpful.

    Thanks again for providing this tutorial!

    Mitch

  8. Pral says:

    Hi,
    I have made a program and saved the file as testgraph.R

    library(igraph)
    G1<-(read.graph("testgraph.txt",format="edgelist"))
    plot(G1,layout=layout.fruchterman.reingold)

    What command should I give in the R-terminal to run this program?

  9. Pral says:

    Oh,i just have to use source(“testgraph.R”). Sorry for bothering

  10. [...] especially in relation to communication on Twitter. I’ve been experimenting both with graphs and with traditional bar and pie charts to show what happens when people use [...]

  11. Inder says:

    Hi
    When using igraph with R for this and i noticed that if the edgelist is greater than 2^29 (~50 million) then igraph fails (unable to hash values)…. any way to work around it?

  12. cornelius says:

    @Inder: not having working with a dataset with that number of edges (or anything close to it) I can’t offer a solution. :-(

  13. [...] produire cela, je me suis appuyé sur divers billets de blog : Cornelius Puschmann : générer des graphes de retweet, une question sur stackoverflow, R-chart, analyser des données de twitter avec R… [...]

  14. CPWilson says:

    Now that TwapperKeeper have discontinued the Export&Download feature, do you know of any workarounds or alternatives to TwapperKeeper that will create a hashtag archive suitable for graphing with Gephi?

  15. Richard says:

    Hi,

    This looked so easy to follow, so off I went…

    I’ve downloaded R and Gephi. I’ve got the file of tweets from Twapperkeeper and I’ve downloaded the code from this page. When I try to run the ‘install.packages(igraph);’ it returns with: ‘Error in install.packages(igraph) : object ‘igraph’ not found’.

    I’m not a programmer of any sort, so perhaps I’m just doing something daft.

    Any help would be much appreciated!

  16. Eve says:

    Hi – I’m trying to replicate this, but as twapperkeeper is no longer allowing export, I’m having file issues.

    Any chance you have either (or both): 1. the original twapperkeeper file you manipulated in R and/or 2. the output from R (Graph ML formatted file)?

    I’m trying to replicate this for a project for school -but am running into some snags.

    Thanks very much!

  17. Yusuf O. says:

    Hi,

    Thanks for your contributions. Your posts have helped me to get to the bottom of the idea and saved the day many times. However, I’m having a trouble over here. I know the reason behind but I don’t have enough knowledge of coding to revert it. Since TwapperKeeper is no longer allowing Export and Download feature, even if you run it on your own dedicated server, I’ve downloaded the archive first in Excel format and then converted into .cvs format. And when I tried to run the .R file above, it returns the following lines:

    starting httpd help server … done
    > source(“/…/tweetgraph.R”)
    Error in file(file, “rt”, encoding = fileEncoding) :
    cannot open the connection
    In addition: Warning messages:
    1: In readLines(file) :
    incomplete final line found on ‘/…/tweetgraph.R’
    2: In file(file, “rt”, encoding = fileEncoding) :
    cannot open file ‘tweets.csv’: No such file or directory
    >

    My take on this error is that I need to revise the codes and make it compatible with the new conditions we face in absence of “Download and Export” feature of TwapperKeeper.

    Can you help with this issue?

    Thanks,

  18. Yusuf O. says:

    Hi again,

    I think the first half of the problem was owing to the delimiter issue. Now that I made it read the file but this time it throws this error:

    > source(“/Users/yufikan/Desktop/cnndebate/tweetgraph.R”)
    Error in read.table(file = file, header = header, sep = sep, quote = quote, :
    duplicate ‘row.names’ are not allowed
    In addition: Warning messages:
    1: In readLines(file) :
    incomplete final line found on ‘/Users/yufikan/Desktop/cnndebate/tweetgraph.R’
    2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
    invalid input found on input connection ‘/Users/yufikan/desktop/cnndebate/tweets.csv’

    Do you have any suggestions?

    Appreciate it!

    Yusuf.

  19. [...] – in an R environment (I use RStudio), reuse code from Rescuing Twapperkeeper Archives Before They Vanish and Cornelius Puschmann’s post Generating graphs of retweets and @-messages on Twitter using R and Gephi: [...]

  20. Ben says:

    Thanks for the detailed instructions. Seems like now that twapperkeeper has gone, Martin Hawksey’s (@mhawksey) google sheet system for collecting tweets is the way to go. For those interested in detailed content analysis of tweets using R, I have put together some code for text mining and topic modeling over here: https://github.com/benmarwick/AAA2011-Tweets

  21. Bertil says:

    Is it possible to get just a few lines of the “tweets.csv” file used
    Really appreciate it!

  22. [...] de um script processado na linguagem de programação ‘R’ encontrado em um blog (http://blog.ynada.com/339), ver script em anexo com nome “tweetgraph.R”. Ele serve para extrair de um arquivo [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>