Plotting texts as graphs with R and igraph

On August 5, 2010, in data, by cornelius

I’ve plotted several word association graphs for this New York Times article (1st paragraph) using R and the igraph library.

#1, random method

text-igraph-random

#2, circle method

text-igraph-circle

#3, sphere method

text-igraph-sphere

#4, spring method

text-igraph-spring

#5, fruchterman-reingold method

text-igraph-fruchterman-reingold

# 6, kamada-kawai method

text-igraph-kamada-kawai

#7, graphopt method

text-igraph-graphopt

The red vertices mark cliques. Here’s the (rough) R code for plotting such graphs:

rm(list=ls());

library("igraph");
library("Cairo");

# read parameters
print("Text-as-Graph for R 0.1");
print("------------------------------------");

print("Path (no trailing slash): ");
datafolder <- scan(file="", what="char");

print("Text file: ");
datafile <- scan(file="", what="char");

txt <- scan(paste(datafolder, datafile, sep="/"), what="char", sep="\n", encoding="UTF-8");

print("Width/Height (e.g. 1024x768): ");
res <- scan(file="", what="char");
rwidth <- unlist(strsplit(res, "x"))[1]
rheight <- unlist(strsplit(res, "x"))[2]

words <- unlist(strsplit(gsub("[[:punct:]]", " ", tolower(txt)), "[[:space:]]+"));

g.start <- 1;

g.end <- length(words) - 1;

assocs <- matrix(nrow=g.end, ncol=2)

for (i in g.start:g.end)
{
assocs[i,1] <- words[i];
assocs[i,2] <- words[i+1];
print(paste("Pass #", i, " of ", g.end, ". ", "Node word is ", toupper(words[i]), ".", sep=""));
}

print("Build graph from data frame...");
g.assocs <- graph.data.frame(assocs, directed=F);

print("Label vertices...");
V(g.assocs)$label <- V(g.assocs)$name;

print("Associate colors...");
V(g.assocs)$color <- "Gray";

print("Find cliques...");
V(g.assocs)[unlist(largest.cliques(g.assocs))]$color <- "Red";

print("Plotting random graph...");
CairoPNG(paste(datafolder, "/", "text-igraph-random.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));
plot(g.assocs, layout=layout.random, vertex.size=4, vertex.label.dist=0);
dev.off();

print("Plotting circle graph...");
CairoPNG(paste(datafolder, "/", "text-igraph-circle.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));
plot(g.assocs, layout=layout.circle, vertex.size=4, vertex.label.dist=0);
dev.off();

print("Plotting sphere graph...");
CairoPNG(paste(datafolder, "/", "text-igraph-sphere.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));
plot(g.assocs, layout=layout.sphere, vertex.size=4, vertex.label.dist=0);
dev.off();

print("Plotting spring graph...");
CairoPNG(paste(datafolder, "/", "text-igraph-spring.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));
plot(g.assocs, layout=layout.spring, vertex.size=4, vertex.label.dist=0);
dev.off();

print("Plotting fruchterman-reingold graph...");
CairoPNG(paste(datafolder, "/", "text-igraph-fruchterman-reingold.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));
plot(g.assocs, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label.dist=0);
dev.off();

print("Plotting kamada-kawai graph...");
CairoPNG(paste(datafolder, "/", "text-igraph-kamada-kawai.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));
plot(g.assocs, layout=layout.kamada.kawai, vertex.size=4, vertex.label.dist=0);
dev.off();

#CairoPNG(paste(datafolder, "/", "text-igraph-reingold-tilford.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));
#plot(g.assocs, layout=layout.reingold.tilford, vertex.size=4, vertex.label.dist=0);
#dev.off();

print("Plotting graphopt graph...");
CairoPNG(paste(datafolder, "/", "text-igraph-graphopt.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));
plot(g.assocs, layout=layout.graphopt, vertex.size=4, vertex.label.dist=0);
dev.off();

print("Done!");

Tagged with:  

I read about this new book series titled Scholarly Communication: Past, present and future of knowledge inscription this morning on the Humanist mailing list. Since scholarly communication is one my main research interests, I’m thrilled to hear that there will be a series devoted to publications focusing on the topic, edited and reviewed by a long list of renown scholars in the field.

On the other hand it’s debatable (see reactions by Michael Netwich and Toma Tasovac) whether a book series on the future of scholarly communication is not a tad anachronistic, assuming it is published exclusively in print (seems to be the case from the look of the announcement on the website). New approaches, such as the crowdsourcing angles of Hacking the Academy or Digital Humanities Now, seem more in sync with Internet-age publishing to me, but sadly such efforts usually don’t involve commercial publishers**. My recent struggles with Oxford University Press over a subscription to Literary and Linguistic Computing (the only way of joining the ALLC) has added once more to my skepticism towards commercial publishers. And not because their goal is to make money — there’s nothing wrong with that inherently — but because they largely refuse to innovate when it comes to their products and business models. Mailing a paper journal to someone who has no use for it is a waste of resources and a sign that you are out of touch with your customers needs… at least if your customer is this guy.

Do scholars in the Humanities and Social Sciences* still need printed publications and (consequently) publishers?

Do we need publishers if we decide to go all-out digital?

Do we need Open Access?

I have different stances in relation to these questions depending on the hat I’m wearing. Individually I think print publishing is stone dead, but I also notice that by and large my colleagues still rely on printed books and journals much more heavily than digital sources. Regarding the role of publishers and Open Access the situation is equally complex: we need publishers if our culture of communication doesn’t change, because reproducing digitally what we used to create in print is challenging (see this post for some deliberations). If we decide that blog posts can replace journal articles because speed and efficiency ultimately win over perfectionism, since we are no longer producing static objects but a constantly evolving discourse — in that case the future of commercial publishers looks uncertain. Digital toll-access publishing seems to have little traction in our field so far, something that is likely to change with the proliferation of ebooks we are likely to see in the next few years.

Anyhow — what’s your take?

Should we get rid of paper?

Should we get rid of traditional formats and post everything in blogs instead?

Is Cameron Neylon right when he says that the future of research communication is aggregation?

Let me know what you think — perhaps the debate can be a first contribution to Scholarly Communication: Past, present and future. :-)

(*) I believe the situation is fundamentally different in STM, where paper is a thing of the past but publishers are certainly not.

(**) An exception of sorts could to be Liquid Pub, but that project seems focused on STM rather than Hum./Soc.Sci.