Post-event Twitter stats for #THATcamp

On May 26, 2010, in data, by cornelius

I thought I’d post an updated version of the simple stats on Twitter activity presented here. The data in the older post was collected before THATcamp took place, the graphs below show the activity during and after the camp.

The tweets I’ve collected are also available here (my own file) and on TwapperKeeper.

Tweets over time (roughly 14th of May to 24th)

Most active users

Most @-messaged users

Most retweeted users

Tagged with:  

URLs tweeted at #THATCamp (all 230 of them)

On May 24, 2010, in data, by cornelius

I’ve data-mined the #thatcamp hashtag a bit more and extracted all 230 links that were tweeted recently (also includes some of THATCamp Paris). Enjoy :-)

(or go here to view the table inside Google Docs)

Tagged with:  

Edit: I’ve posted an updated version of the script here. It is not quite as compressed as Anatol’s version, but I think it’s a decent compromise between readability and efficiency. :-)

Edit #2 And yet another update, this one contributed by Kai Heinrich.

I hacked together some code for R last night to visualize a Twitter graph (=who you are following and who is following you) that I briefly showed at the session on visualizing text today at THATCamp and that I wanted to share. My comments in the code are very basic and there is much to improve, but in the spirit of “release early, release often”, I think it’s better to get it out there right away.


Note that packages are most easily installed with the install.packages() function inside of R, so R is really the only thing you need to download initially.


# Load twitteR package

# Load igraph package

# Set up friends and followers as vectors. This, along with some stuff below, is not really necessary, but the result of my relative inability to deal with the twitter user object in an elegant way. I'm hopeful that I will figure out a way of shortening this in the future

friends <- as.character()
followers <- as.character()

# Start an Twitter session. Note that the user through whom the session is started doesn't have to be the one that your search for in the next step. I'm using myself (coffee001) in the code below, but you could authenticate with your username and then search for somebody else.

sess <- initSession('coffee001', 'mypassword')

# Retrieve a maximum of 500 friends for user 'coffee001'.

friends.object <- userFriends('coffee001', n=500, sess)

# Retrieve a maximum of 500 followers for 'coffee001'. Note that retrieving many/all of your followers will create a very busy graph, so if you are experimenting it's better to start with a small number of people (I used 25 for the graph below).

followers.object <- userFollowers('coffee001', n=500, sess)

# This code is necessary at the moment, but only because I don't know how to slice just the "name" field for friends and followers from the list of user objects that twitteR retrieves. I am 100% sure there is an alternative to looping over the objects, I just haven't found it yet. Let me know if you do...

for (i in 1:length(friends.object))
friends <- c(friends, friends.object[[i]]@name);

for (i in 1:length(followers.object))
followers <- c(followers, followers.object[[i]]@name);

# Create data frames that relate friends and followers to the user you search for and merge them.

relations.1 <- data.frame(User='Cornelius', Follower=friends)
relations.2 <- data.frame(User=followers, Follower='Cornelius')
relations <- merge(relations.1, relations.2, all=T)

# Create graph from relations.

g <-, directed = T)

# Assign labels to the graph (=people's names)

V(g)$label <- V(g)$name

# Plot the graph.


For the screenshot below I've used the tkplot() method instead of plot(), which allows you to move around and highlight elements interactively with the mouse after plotting them. The graph only shows 20 people in order to keep the complexity manageable.

Tagged with:  

Timely or Timeless? The Scholar’s Dilemma.

On May 19, 2010, in Thoughts, by cornelius

Note: this introduction, co-authored with Dieter Stein, is part of the volume Selected Papers from the Berlin 6 Open Access Conference, which will appear via Düsseldorf University Press as an electronic open access publication in the coming weeks. It is also a response to this blog post by Dan Cohen.

Timely or Timeless? The Scholar’s Dilemma. Thoughts on Open Access and the Social Contract of Publishing

Some things don’t change.

We live in a world seemingly over-saturated with information, yet getting it out there in both an appropriate form and a timely fashion is still challenging. Publishing, although the meaning of the word is undergoing significant change in the time of iPads and Kindles, is still a very complex business. In spite of a much faster, cheaper and simpler distribution process, producing scholarly information that is worth publishing is still hard work and so time-consuming that the pace of traditional academic communication sometimes seems painfully slow in comparison to the blogosphere, Wikipedia and the ever-growing buzz of social networking sites and microblogging services. How idiosyncratic does it seem in the age of cloud computing and the real-time web that this electronic volume is published one and a half years after the event its title points to? Timely is something else, you might say.

Dan Cohen, director of the Center for History and New Media at George Mason University, discusses the question of why academics are so obsessed with formal details and consequently so slow to communicate in a blog post titled “The Social Contract of Scholarly Publishing“. In it, Dan retells the experience of working on a book together with colleague Roy Rosenzweig:

“So, what now?” I said to Roy naively. “Couldn’t we just publish what we have on the web with the click of a button? What value does the gap between this stack and the finished product have? Isn’t it 95% done? What’s the last five percent for?”

We stared at the stack some more.

Roy finally broke the silence, explaining the magic of the last stage of scholarly production between the final draft and the published book: “What happens now is the creation of the social contract between the authors and the readers. We agree to spend considerable time ridding the manuscript of minor errors, and the press spends additional time on other corrections and layout, and readers respond to these signals — a lack of typos, nicely formatted footnotes, a bibliography, specialized fonts, and a high-quality physical presentation — by agreeing to give the book a serious read.”

A social contract between author and reader. Nothing more, nothing less.

It may seem either sympathetic or quaint how Roy Rosenzweig elevates the product of scholarship from a mere piece of more or less monitizable content to something of cultural significance, but he also aptly describes what many academics, especially in the humanities, think of as the essence of their work: creating something timeless. That is, in short, why the humanities are still in love with books, why they retain a pace of publishing that is entirely snail-like, both to other academic fields and to the rest of the world. Of course humanities scholars know as well as anyone that nothing is truly timeless and understand that trends and movements shape scholarship just like they shape fashion and music. But there is still a commitment to spend time to deliver something to the reader that is a polished and perfected as one can manage. Something that is not rushed, but refined. Why? Because the reader expects authority from a scholarly work and authority is derived from getting it right to the best of one’s ability.

This is not just a long-winded apology to the readers and contributors to this volume, although an apology for the considerable delay is surely in order, especially taking into account the considerable commitment and patience of our authors (thank you!). Our point is something equally important, something that connects to Roy Rosenzweig’s interpretation of scholarly publishing as a social contract. This publication contains eight papers produced to expand some of the talks held at the Berlin 6 Open Access Conference that took place in November 2008 in Düsseldorf, Germany. While Open Access has successfully moved forward in the past eighteen months and much has been achieved, none of the needs, views and fundamental aspects addressed in this volume — policy frameworks to enable it (Forster, Furlong), economic and organizational structures to make it viable and sustainable (Houghton; Gentil-Beccot, Mele, and Vigen), concrete platforms in different regions (Packer et al) and disciplines (Fritze, Dallmeier-Tiessen and Pfeiffenberger) to serve as models, and finally technical standards to support it (Zier) — none of these things have lost any of their relevance.

Open Access is a timely issue and therefore the discussion about it must be timely as well, but “discussion” in a highly interactive sense is hardly ever what a published volume provides anyway – that is something the blogosphere is already better at. That doesn’t mean that what scholars produce, be it in physics, computer science, law or history should be hallowed tomes that appear years after the controversies around the issues they cover have all but died down, to exist purely as historical documents. If that happens, scholarship itself has become a museal artifact that is obsolete, because a total lack of urgency will rightly suggest to people outside of universities that a field lacks relevance. If we don’t care when it’s published, how important can it be?

But can’t our publications be both timely and timeless at once? In other words, can we preserve the values cited by Roy Rosenzweig, not out of some antiquated fetish for scholarly works as perfect documents, but simply because thoroughly discussed, well-edited and proofed papers and books (and, for that matter, blog posts) are nicer to read and easier to understand than hastily produced ones? Readers don’t like it when their time is wasted; this is as true as ever in the age of information overload. Scientists are expected to get it right, to provide reliable insight and analysis. Better to be slow than to be wrong. In an attention economy, perfectionism pays a dividend of trust.

How does this relate to Open Access? If we look beyond the laws and policy initiatives and platforms for a moment, it seems exceedingly clear that access is ultimately a solvable issue and that we are fast approaching the point where it will be solved. This shift is unlikely to happen next month or next year, but if it hasn’t taken place a decade from now our potential to do innovative research will be seriously impaired and virtually all stakeholders know this. There is growing political pressure and commercial publishers are increasingly experimenting with products that generate revenue without limiting access. Historically, universities, libraries and publishers came into existence to solve the problem of access to knowledge (intellectual and physical access). This problem is arguably in the process of disappearing, and therefore it is of pivotal importance that all those involved in spreading knowledge work together to develop innovative approaches to digital scholarship, instead of clinging to eroding business models. As hard as it is for us to imagine, society may just find that both intellectual and physical access to knowledge are possible without us and that we’re a solution in search of a problem. The remaining barriers to access will gradually be washed away because of the pressure exerted not by lawmakers, librarians and (some) scholars who care about Open Access, but mainly by a general public that increasingly demands access to the research it finances. Openness is not just a technicality. It is a powerful meme that permeates all of contemporary society.

The ability for information to be openly available creates a pressure for it to be. Timeliness and timelessness are two sides of the same coin. In the competitive future of scholarly communication, those who get everything (mostly) right will succeed. Speedy and open publication of relevant, high quality content that is well adjusted to the medium and not just the reproduction of a paper artifact will trump those publications that do not meet all the requirements. The form and pace possible will be undercut by what is considered normal in individual academic disciplines and the conventions of one field will differ from those of another. Publishing less or at a slower pace is unlikely to be perceived as a fault in the long term, with all of us having long gone past the point of informational over-saturation. The ability to effectively make oneself heard (or read), paired with having something meaningful to say, will (hopefully) be of increasing importance, rather than just a high volume of output.

Much of the remaining resistance to Open Access is simply due to ignorance, and to murky premonitions of a new dark age caused by a loss of print culture. Ultimately, there will be a redefinition of the relativities between digital and print publication. There will be a place for both: the advent of mass literacy did not lead to the disappearance of the spoken word, so the advent of the digital age is unlikely to lead to the disappearance of print culture. Transitory compromises such as delayed Open Access publishing are paving the way to fully-digital scholarship. Different approaches will be developed, and those who adapt quickly to a new pace and new tools will benefit, while those who do not will ultimately fall behind.

The ideological dimension of Open Access – whether knowledge should be free – seems strangely out of step with these developments. It is not unreasonable to assume that in the future, if it’s not accessible, it won’t be considered relevant. The logic of informational scarcity has ceased to make sense and we are still catching up with this fundamental shift.

Openness alone will not be enough. The traditional virtues of a publication – the extra 5% – are likely to remain unchanged in their importance while there is such a things as institutional scholarship. We thank the authors of this volume for investing the extra 5% for entering a social contract with their readers and another, considerable higher percentage for their immense patience with us. The result may not be entirely timely and, as has been outlined, nothing is ever truly timeless, but we strongly believe that its relevance is undiminished by the time that has passed.

Open Access, whether 2008 or 2010, remains a challenge – not just to lawmakers, librarians and technologists, but to us, to scholars. Some may rise to the challenge while others remain defiant, but ignorance seems exceedingly difficult to maintain. Now is a bad time to bury one’s head in the sand.


Mai 2010

Cornelius Puschmann and Dieter Stein

Visualizing text: theory and practice

On May 19, 2010, in Thoughts, by cornelius

Note: I’ve also posted this on

Bad, bad me — of course I’ve been putting off writing up my ideas and thoughts for THATcamp almost to the latest possible moment. Waiting so long has one definitive advantage though: I get to point to some of the interesting suggestions that have already been posted here and (hopefully) add to them.

I’d like to both discuss and do text visualization. Charts, maps, infographics and other forms of visualization are becoming increasingly popular as we are faced with large quantities of textual data from a variety of sources. To linguists and literary scholars, visualizing texts can (among other things) be interesting to uncover things about language as such (corpus linguistics) and about individual texts and their authors (narratology, stylometrics, authorship attribution), while to a wide range of other disciplines the things that can be inferred from visualization (social change, spreading of cultural memes) beyond the text itself can be interesting.

What can we potentially visualize? This may seem to be a naive question, but I believe that only by trying out virtually everything we can think of (distribution of letters, words, word classes, n-grams, paragraphs, …; patterning of narrative strands, structure of dialog, occurrence of specific rhetorical devices; references to places, people, points in time…; emotive expressions, abstract verbs, dream sequences… you name it) can we reach conclusions about what (if anything!) these things might mean.

How can we visualize text? If we consider for a moment how we mostly visualize text today it quickly becomes apparent that there is much more we could be doing. Bar plots, line graphs and pie charts are largely instruments for quantification, yet very often quantitative relations between elements aren’t our only concern when studying text. Word clouds add plasticity, yet they eliminate the sequential patterning of a text and thus do not represent its rhetorical development from beginning to end. Trees and maps are interesting in this regard, but by and large we hardly utilize the full potential of visualization as a form of analysis, for example by using lines, shapes, color (!) and beyond that, movement (video) in a way that suits the kind of data we are dealing with.

What tools can we use to do visualization? I’m very interested in Processing and have played with it, also more extensively with R and NLTK/Python. Tools for rendering data, such as Google Chart Tools, igraph and RGraph are also interesting. Other, non-statistical tools are also an option: free hand drawing tools and web-based services like Many Eyes. Visualization doesn’t need to be restricted to computation/statistics. Stephanie Posavec‘s trees are a dynamic mix of automation and manual annotation and demonstrate that visualizations are rhetorically powerful interpretations themselves.

I hope that some of the abovementioned things connect to other THATcampers’ ideas, e.g. Lincoln Mullen’s post on mining scarce sources and Bill Ferster’s post on teaching using visualization.

Don’t get me started on the potential for teaching. Ultimately translating a text into another form is a unique kind of critical engagement: you’re uncovering, interpreting and making an argument all at once, both to the text in question and to yourself.

Anyway — anything from discussing theoretical issues of visualization to sharing code snippets would fit into this session and I’m looking forward to hearing other campers’ thoughts and experiences on the subject.

Tagged with:  

Edit: this post on (legal aspects of) data sharing by Creative Commons’ Kaitlin Thaney is also highly recommended.

Edit #2: This is another cross post with

If you’re involved in academic publishing — whether as a researcher, librarian or publisher — data sharing and data publishing are probably hot issues to you. Beyond its versatility as a platform for the dissemination of articles and ebooks, the Internet is increasingly also a place where research data lives. Scholars are no longer restricted to referring to data in their publications or including charts and graphs alongside the text, but can link directly to data published and stored elsewhere, or even embed data into their papers, a process facilitated by standards such as the Resource Description Framework (RDF).

Journals such as Earth System Science Data and the International Journal of Robotics Research give us a glimpse at how this approach might evolve in the future — from journals to data journals, publications which are concerned with presenting valuable data for reuse and pave the way for a research process that is increasingly collaborative. Technology is gradually catching up with the need for genuinely digital publications, a need fueled by the advantages of able to combine text, images, links, videos and a wide variety of datasets to produce a next-generation multi-modal scholarly article. Systems such as Fedora and PubMan are meant to facilitate digital publishing and assure best-practice data provenance and storage. They are able to handle different types of data and associate any number of individual files with a “data paper” that documents them.

However, technology is the much smaller issue when weighing the advantages of data publishing with its challenges — of which there are many, both to practitioners and to those supporting them. Best practices on the individual level are cultural norms that need to be established over time. Scientists still don’t have sufficient incentives to openly share their data, as tenure processes are tied to publishing results based on data, but not on sharing data directly. And finally, technology is prone to failure when there are no agreed-upon standards guiding its use and such standards need to be gradually (meaning painfully slowly, compared with technology’s breakneck pace) established  accepted by scholars, not decreed by committee.

In March, Jonathan Rees of NeuroCommons (a project within Creative Commons/Science Commons) published a working paper that outlines such standards for reusable scholarly data. One thing I really appreciate about Rees’ approach is that it is remarkably discipline-independent and not limited to the sciences (vs. social science and the humanities).

Rees outlines how data papers differ from traditional papers:

A data paper is a publication whose primary purpose is to expose and describe data, as opposed to analyze and draw conclusions from it. The data paper enables a division of labor in which those possessing the resources and skills can perform the experiments and observations needed to collect potentially interesting data sets, so that many parties, each with a unique background and ability to analyze the data, may make use of it as they see fit.

The key phrase here (which is why I couldn’t resist boldfacing it) is division of labor. Right now, to use an auto manufacturing analogy, a scholar does not just design a beautiful car (an analysis in the form of a research paper that culminates in observations or theoretical insights), he also has to build an engine (the data that his observations are based on). It doesn’t matter if she is a much better engineer than designer, the car will only run (she’ll only get tenure) if both the engine and the car meet the same requirements. The car analogy isn’t terribly fitting, but it serves to make the point that our current system lacks a division of labor, making it pretty inefficient. It’s based more on the idea of producing smart people than on the idea of getting smart people to produce reusable research.

Rees notes that data publishing is a complicated process and lists a set of rules for successful sharing of scientific data.

From the paper:

  1. The author must be professionally motivated to publish the data
  2. The effort and economic burden of publication must be acceptable
  3. The data must become accessible to potential users
  4. The data must remain accessible over time
  5. The data must be discoverable by potential users
  6. The user’s use of the data must be permitted
  7. The user must be able to understand what was measured and how (materials and methods)
  8. The user must be able to understand all computations that were applied and their inputs
  9. The user must be able to apply standard tools to all file formats

At a glance, these rules signify very different things. #1 and #2 are preconditions, rather than prescriptions while #3 – #6 are concerned with what the author needs to do in order to make the data available. Finally, rules #7 – #10 are corned with making the data as useful to others as possible. Rules #7 -#10 are dependent on who “the user” is and qualify as “do-this-as-best-as-you-can”-style suggestions, rather than strict requirements, not because they aren’t important, but because it’s impossible for the author to guarantee their successful implementation. By contrast, #3 -#6 are concerned with providing and preserving access and are requirements — I can’t guarantee that you’ll understand (or agree with) my electronic dictionary on Halh Mongolian, but I can make sure it’s stored in an institutional or disciplinary repository that is indexed in search engines, mirrored to assure the data can’t be lost and licensed in a legally unambiguous way, rather that upload it to my personal website and hope for the best when it comes to long-term availability, ease of discovery and legal re-use.

Finally, Rees gives some good advice beyond tech issues to publishers who want to implement data publishing:

Set a standard. There won’t be investment in data set reusability unless granting agencies and tenure review boards see it as a legitimate activity. A journal that shows itself credible in the role of enabling reuse will be rewarded with submissions and citations, and will in turn reward authors by helping them obtain recognition for their service to the research community.

This is critical. Don’t wait for universities, grant agencies or even scholars to agree on standards entirely on their own — they can’t and won’t if they don’t know how digital publishing works (legal aspects included). Start an innovative journal and set a standard yourself by being successful.

Encourage use of standard file formats, schemas, and ontologies. It is impossible to know what file formats will be around in ten years, much less a hundred, and this problem worries digital archivists. Open standards such as XML, RDF/XML, and PNG should be encouraged. Plain text is generally transparent but risky due to character encoding ambiguity. File formats that are obviously new or exotic, that lack readily available documentation, or that do not have non-proprietary parsers should not be accepted. Ontologies and schemas should enjoy community acceptance.

An important suggestion that is entirely compatible with linguistic data (dictionaries, word lists, corpora, transcripts, etc) and simplified by the fact that we have comparably small datasets. Even a megaword corpus is small compared to climate data or gene banks.

Aggressively implement a clean separation of concerns. To encourage submissions and reduce the burden on authors and publishers, avoid the imposition of criteria not related to data reuse. These include importance (this will not be known until after others work with the data) and statistical strength (new methods and/or meta-analysis may provide it). The primary peer review criterion should be adequacy of experimental and computational methods description in the service of reuse.

This will be a tough nut to crack, because it sheds tradition to a degree. Relevance was always high on the list of requirements while publications were scarce — paper costs money, therefor what was published had to important to as many people as possible. With data publishing this is no longer true — whether something is important or statistically strong (applying this to linguistics one might say representative, well-documented, etc) is impossible to know from the onset. It’s much more sensible to get it out there and deal with the analysis later, rather than creating an artificial scarcity of data. But it will take time and cultural change to get researchers (and funding both funding agencies and hiring committees) to adapt to this approach.

In the meantime, while we’re still publishing traditional (non-data) papers, we can at least work on making them more accessible. Something like arXiv for linguistics wouldn’t hurt.