Small notes on Big Data from Oxford

On March 25, 2013, in Events, by cornelius

As it was snowing cats and dogs in beautiful Oxford this weekend, I figured I might as well get to a much neglected task: blogging. Following the excellent Workshop Big Data: Rewards and Risks for the Social Sciences here at the Oxford Internet Institute last week (to be followed by another event this week) it feels especially timely to write up a short summary of what I’ve been doing during my stay here, and in the last couple of months in general.

Taken in January, but two months later the weather in this part of England hasn’t changed much.

It’s been fantastic time at the OII so far, and I have to admit that if I could, I would love to stay for a longer period. I arrived in January to snow and after figuring out the ins and outs of Oxford life (how to handle various keycards, where to shop and get coffee, which pubs are best), I quickly settled into my new office on 66 Banbury Road, the institute’s northern outpost. Office space is in short supply for obvious reasons in Oxford (as one can’t exactly tear down a medieval college…), but I liked 66 right away, simply because everyone I’ve encountered here has been incredibly welcoming and friendly, making it easy to settle in and get a lot of work done. I’ve had the chance to chat with a fantastic variety of people in the office, during brownbags, workshops, and over lunch, and it’s an absolutely unique environment. 

So what’s been happening on my end? A few days before my arrival in Oxford, my awesome colleague Jean Burgess and I published our working paper The Politics of Twitter Data in the Humboldt Institute for Internet and Society’s SSRN Discussion Paper Series. The piece has been quite well received, with coverage from Patrick Maier in the NatGeo Explorer blog, as well as from Netzpolitik. Also shortly around the time of my arrival in the UK, the volume Pragmatics of CMC was published by De Gruyter, a volume that has been in the making for quite a few years. I contributed a chapter on blogging to the handbook, expertly edited by Susan C. Herring, Dieter Stein, and Tuija Virtaanen, which destills a lot of my previous research on blogging.

A few weeks into my stay, I was invited to give a talk in the Nuffield Network Seminar Series (see slides below). My presentation focused on the scientific blog networks Hypotheses.org, providing an analysis of the dynamics of knowledge exchange between different scholarly communities inside the platform. A particular interest for me are disciplinary and linguistic communities, a theme that Marco Toledo Bastos, Rodrigo Travitzki and me have also explored in a recent paper on Twitter activism about which I’ll post more soon (Marco will present this research in Paris at HyperText 2013 next month). I’m very keen on doing more (and especially more sophisticated) network analysis and feedback from Bernie Hogan, Sandra Gonzalez-Bailon, and Taha Yasseri has already been invaluable in this regard. I’ll also be delivering an invited talk at the annual meeting of the Berliner Arbeitskreis Information next month on the role of Big Data for knowledge production, in which I will combine insights from research with a dose of criticism and reflection.

Later this spring, I am exceptionally looking forward to ICA 2013 in London, where I will be presenting two papers, one as part of the panel Big Data and Communication Research: Prospects, Perils, Alliances, and Impacts, chaired by Eric T. Meyer,  and another in a (oddly enough) session on Copyright and Digital Piracy with my colleague Merja Mahrt (who, by the way, has also written this important piece on Big Data for communications research). The first panel will see contributions from Eric, Ralph Schroeder, Bernie Hogan and Mark Graham, Matthew Weber, and from danah boyd and Kate Crawford, all of whom are exceptional researchers. My talk will focus on the politics (and economics) of social media platforms as characterized in the relationship between platform providers, data resellers, large media companies and consumers.

As you can probably guess, all of this ties in beautifully with the ongoing activities relating to Big Data here at the OII and future research at the Humboldt Institute for Internet and Society, where Big Data is also a major topic. The workshop last week was part of an initiative funded by the Sloan Foundation to promote discussion about Big Data in the Social Sciences. More discussion is needed about the impact of Big Data on scholarly research, but also on politics, business and culture more broadly.

It seems that 2013 is shaping up to be the year in which academia catches up with Big Data — or at least with some of the hype surrounding it.

I’ve already shared this bit of personal news with a few friends and colleagues, but I thought I’d blog about it as well — especially since I’m woefully behind on my Iron Blogger schedule. ;-)

After a fairly long time in the making, I have been awarded a three-year research grant from the Deutsche Forschungsgemeinschaft (DFG) for the project Networking, visibility, information: a study of digital genres of scholarly communication and the motives of their users (summary in German on the DFG’s site). The project investigates new forms of scholarly communication (especially blogging and Twitter) and their role for academia. My key concerns are usage motives, i.e. why scholars use blogs and Twitter, and how these motives correspond with usage practices (how they blog and tweet), rather than how many researchers use these channels of communication or what makes them refrain from using them (see this blog post and the study mentioned in it for that kind of work). My main methods will be qualitative interviews with a sample of 20-25 blogging and/or tweeting academics, along with in-depth content analysis of the material they post in these channels over a prolonged period (>1 year). Identifying usage patterns and relating them to the participants’ narrative about their use will be another key objective. Ultimately, I hope to find a (tentative) answer to the question what role blogs and Twitter may play for the future of digital scholarship, and whether they will remain a niche phenomenon or become mainstream over time.

The project follows up on my work on corporate blogging and connects strongly to what we have been doing at the Junior Researchers Group “Science and the Internet” over the past year, but the focus on interviews should result in a more user-centric analysis. As someone who has been doing (applied) linguistic analysis to make inferences about social processes, I feel much more comfortable actually talking to the people I want to study, rather than just crunching numbers on how they tweet. Big data social science research is obviously and understandably en vogue these days, but I hope to find a good synergy between qualitative and quantitative approaches in my project.

My new institutional home for the next three years will be the Berlin School of Library and Information Science at Humboldt University. I’m grateful to Michael Seadle for supporting my project and really look forward to working with my new colleagues at IBI (that’s the German acronym, which, as far as I can tell, is preferred to its more entertaining English equivalent). I also look forward to working with colleagues from the Alexander von Humboldt Institute for Internet and Society (HIIG) where I’m currently supporting the project Regulation Watch. Finally, I plan to keep in close contact with the colleagues in Düsseldorf, both at the Junior Researchers Group and the Department of English Language and Linguistics, where I have learned virtually everything I know about being a researcher. I am especially indebted to Dieter Stein for his enduring support and for his contagious enthusiasm for all aspects of scholarship.

Sic itur ad astra! :-)

For an overview of previous work I’ve done in this direction, have a look at my publications.

Tagged with:  

Those of you following my occasional updates here know that I have previously posted code for graphing Twitter friend/follower networks using R (post #1. post #2). Kai Heinrich was kind enough to send me some updated code for doing so using a newer version of the extremely useful twitteR package. His very crisp, yet thoroughly documented script is pasted below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Script for graphing Twitter friends/followers
# by Kai Heinrich (kai.heinrich@mailbox.tu-dresden.de) 
 
# load the required packages
 
library("twitteR")
library("igraph")
 
# HINT: In order for the tkplot() function to work on mac you need to install 
#       the TCL/TK build for X11 
#       (get it here: http://cran.us.r-project.org/bin/macosx/tools/)
#
# Get User Information with twitteR function getUSer(), 
#  instead of using ur name you can do this with any other username as well 
 
start<-getUser("YOUR_USERNAME") 
 
# Get Friends and Follower names with first fetching IDs (getFollowerIDs(),getFriendIDs()) 
and then looking up the names (lookupUsers()) 
 
friends.object<-lookupUsers(start$getFriendIDs())
follower.object<-lookupUsers(start$getFollowerIDs())
 
# Retrieve the names of your friends and followers from the friend
# and follower objects. You can limit the number of friends and followers by adjusting the 
# size of the selected data with [1:n], where n is the number of followers/friends 
# that you want to visualize. If you do not put in the expression the maximum number of 
# friends and/or followers will be visualized.
 
n<-20 
friends <- sapply(friends.object[1:n],name)
followers <- sapply(followers.object[1:n],name)
 
# Create a data frame that relates friends and followers to you for expression in the graph
relations <- merge(data.frame(User='YOUR_NAME', Follower=friends), 
data.frame(User=followers, Follower='YOUR_NAME'), all=T)
 
# Create graph from relations.
g <- graph.data.frame(relations, directed = T)
 
# Assign labels to the graph (=people's names)
V(g)$label <- V(g)$name
 
# Plot the graph using plot() or tkplot(). Remember the HINT at the 
# beginning if you are using MAC OS/X
tkplot(g)
Tagged with:  

Ahead of publishing my TwitterFunctions library of R code (which is constant work in progress) I thought I’d put up some really short Python code for getting a person’s friends and followers. Both scripts rely on Tweepy, my favorite Python implementation of the Twitter API. Install Python (works on Windows as well, not just on Mac/Linux) and then Tweepy on top of that and you are good to go with these two scripts, which can be executed from the command line with
python get_friends.py username

1
2
3
4
5
6
import sys
import tweepy
 
user = sys.argv[1]
for friend in tweepy.api.friends(user):
	print friend.screen_name
1
2
3
4
5
6
import sys
import tweepy
 
user = sys.argv[1]
for follower in tweepy.api.followers(user):
	print follower.screen_name
Tagged with:  

Unfortunately I’m not able to attend the annual IPrA conference next week in Manchester and had to cancel the trip short notice. I was scheduled to give a talk as part of the session Quoting in Computer-mediated Communication on my work with Katrin Weller on retweeting among scientists.

Luckily for me, there will be a follow-up event of sorts (see below). I’ve posted the call here since it doesn’t seem to be available on the Web other than as a PDF. Submit something if you’re doing research on quoting! I’m fairly sure that the deadline will be extended by a week or two.

CfP: Quoting Now and Then – 3rd International Conference on Quotation and Meaning (ICQM)

University of Augsburg, Germany

19 April – 21 April 2012

Conference Convenors:
Wolfram Bublitz
Jenny Arendholz
Christian Hoffmann
Monika Kirner

Contact: Monika Kirner
E-mail: monika.kirner@phil.uni-augsburg.de

Call for Papers
This conference addresses the pragmatics of quoting as a metacommunicative act both in old (printed) and new (electronically mediated) communication. With the rapid evolution of new media in the last two decades, approaches to the study of (forms, functions and impact of) quoting have been gaining momentum in linguistics. Although quotations in print media have already been investigated to some extent, quoting in computer-mediated communication is still unchartered territory. This conference shall focus on the formal and functional evolution of quoting from old (analog) to new (digital) media. While the conference builds on the panel “Quoting in Computer-mediated Communication” to be presented in July 2011 at the International Conference of Pragmatics (IPrA), it assumes a much broader perspective, paying special tribute to the inherent confluence and complementarity of synchronic and diachronic approaches. Consequently, we invite papers from both (synchronic and diachronic) perspectives to report on the formal, functional as well as the pragmatic-discursive and multimodal nature of quoting in different genres or media.

Plenary talk: Jörg Meibauer

Abstracts:
Please submit an abstract of not more than 500 words (for a 30 min talk plus 10 min discussion) via e-mail to monika.kirner@phil.uni-augsburg.de

Deadline for abstracts:
1 July 2011
15 August 2011

Tagged with:  

I meant to post this a month or so ago, when I was conducting my study of casual tweeting, but didn’t get to it. No harm in posting it now, I guess — code doesn’t go bad, fortunately.

Note: this requires Linux/Unix/OSX, Python 2.6 and the tweepy library. It might also work on Windows, but I haven’t checked.

1. Fetching a single user’s tweets with twitter_fetch.py

The purpose of the script below is to automatically retrieve all new tweets by one or more users, where “new” means all tweets that have been added since the last round of archiving. If the script is called for the first time for a given user, it will try to retrieve all available tweets for that person. It relies on the tweepy package for Python, which is one of a number of libraries providing access to the Twitter API. In case you’re looking for a library for R, check out twitteR.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import sys
import time
import os
import tweepy
 
# make sure that the directory 'Tweets' exists, this is
# where the tweets will be archived
wdir = 'Tweets'
user = sys.argv[1]
id_file = user + '.last_id'
timeline_file = user + '.timeline'
 
if os.path.exists(wdir + '/' + id_file):
	f = open(wdir + '/' + id_file, 'r')
	since = int(f.read())
	f.close()
	tweets = tweepy.api.user_timeline(user, since_id=since)
else:
	tweets = tweepy.api.user_timeline(user)
 
if len(tweets) > 0:
	last_id = str(tweets[0].id)
	tweets.reverse()
 
	# write tweets to file
	f = open(wdir + '/' + timeline_file, 'a+')
	for tweet in tweets:
		output = str(tweet.created_at) + '\t' + tweet.text.replace('\r', ' ').encode('utf-8') + '\t' + tweet.source.encode('utf-8') + '\n'
		f.write(output)
		print output
	f.close()
 
	# write last id to file
	f = open(wdir + '/' + id_file, 'w')
	f.write(last_id)
	f.close()
else:
	print 'No new tweets for ' + user

The code is pretty straight-forward. I wrote it without really knowing Python beyond the bare essentials and relying heavily on IPython‘s code completion. Actual retrieval of tweets happens in a single line:

tweets = tweepy.api.user_timeline(user)
 

The rest of the script is devoted to managing the data and making sure only new tweets are retrieved. This is done via the since_id parameter which is fed the last recorded id that has been saved to the user’s id file in the previous round of archiving. There are more elegant ways of doing this, but any improvements are up to you. ;-)

2. Fetching a bunch of different users’ tweets with twitter_fetch_all.sh

Second comes a very simple bash script. The only thing it does is call twitter_fetch.py once for each user in a list of people you want to track. Again, there are probably other ways of doing this, but I wanted to keep the functions of the two different scripts separate.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# This is will perform twitter_fetch,py on the twitter_users[] array. Add any number of twitter_users[NUMBER]="USER" lines below
# to archive additional accounts.
 
# --- twitter user list ---
twitter_users[0]="SomeUser"
twitter_users[1]="SomeOtherUser"
twitter_users[2]="YetAnotherUser"
twitter_users[3]="YouGetTheIdea"
 
# --- execute twitter_fetch.py ---
for twitter_user in ${twitter_users[*]}
do
	echo "Getting tweets for user $twitter_user"
	python twitter_fetch.py $twitter_user
	echo "Done."
	echo ""
done

You should place this in the same directory as twitter_fetch.py and modify it to suit your needs.

3. Automating the whole thing with a cronjob

Finally, here’s a cron directive I used to automate the process and log the result in case any errors occur. Read the linked Wikipedia article if you’re unfamiliar with cron, it’s a very convenient way of automating tasks on Linux/Unix.

0 * * * * sh /root/twitter_fetch_all.sh >/root/twitter_fetch.log
 

(Yes, I’m running this as root. Because I can. And because it’s an EC2 instance with nothing else on it anyway.)

Hope it’s useful to someone, let me know if you have any questions. :-)

Tagged with:  

A crisis of brains indeed

On May 30, 2011, in Thoughts, by cornelius

I want to take a moment to comment on NYT editor Bill Keller’s op ed The Twitter Trap that I read this morning over coffee.

Keller’s piece is one of those self-proclaimed “thoughtful critiques” of digital media that journalists write to prove how relevant they still are. I don’t know about you, but I’m starting to feel like I’ve read hundreds of these “I’m not a luddite, but..”-articles about the pros and cons of the Net and its “impact on our culture” by now. Have a look and see if you recognize the script.

The line of argumentation is sadly predictable: First we are assured of how technologically progressive and up to date the author is and then Twitter, Facebook et al are associated with information overload, dropping attention spans, changes to our brains, souls, culture, and overall well-being. Sprinkled in are references and analogies that make the changes afforded by digital technology seem awe-inspiring and unique, oddly supporting the author’s argument about their world-changing potential. Comparing Mark Zuckerberg to Johannes Gutenberg (as done by Keller), may look catchy at first glance, but feels wrong for so many reasons I don’t know where to begin. For one thing Gutenberg’s invention didn’t make him quite as rich as Goldman Sachs made Zuckerberg and was conceived, you know, to spread the Word of God, rather then monopolizing social networking sites. People like Vint Cerf and Tim Berners-Lee (to name just two examples) were a lot more like Gutenberg in the sense that they enabled a medial shift they didn’t anticipate. Had Gutenberg been like Zuckerberg, we’d all be using the exact same printing press, and Johannes would be able to make sure we’re not using it to print anything nasty.

Forget Twitter and Facebook, this is the real threat to our brains.

Keller uses a fairly canonical set of arguments for his critique. He starts by claiming that people were more adept in memorizing large amounts of information in the pre-Gutenberg era. Alright, agreed, but wouldn’t this have to be followed up with a long list of other things that were also different back then? Mass-media fantasies aside, I think we have one hell of a hard time relating to the medieval mindset and oral culture (which is nothing medieval) is the smallest reason why.

Have our culture and brains really been under siege since the invention of cuniform or alternately, movable type? That’s one hell of a downward spiral, Bill. One oddly decontextualized neuroscience soundbite and several personal anecdotes later, Keller closes with nothing less than a plea for young, confused souls:

My own anxiety is less about the cerebrum than about the soul, and is best summed up not by a neuroscientist but by a novelist. In Meg Wolitzer’s charming new tale, “The Uncoupling,” there is a wistful passage about the high-school cohort my daughter is about to join. Wolitzer describes them this way: “The generation that had information, but no context. Butter, but no bread. Craving, but no longing.”

No longing, Bill? Like, seriously?

(I’ll skip the part where nobody, you know, knowledgable is consulted on the topic, but an argument is instead stitched together from a (seemingly unrelated) neuroscience experiment and a novel. Just know that I’ll be watching for a similar choice of sources in a NYT op ed when the next energy crisis or financial crisis looms.)

If you ask me there’s plenty of longing alright, but it’s not the longing, craving or whatever of these poor young people for “context”, but rather the longing of a newspaper editor for coherence, authority and control. It’s a crisis of brain and soul for journalists and other power elites (scholars, teachers, politicians, parents) who find themselves challenged by what the kids are doing. The example Keller gives about asking a complex question on Twitter and getting a short, reductive answer is telling in several ways. It’s not just that the expectations are wrong, it’s that his way of using Twitter is characterized as the way of using Twitter. Only that a prominent journalist’s use of microblogging bears little similarity to what the kids for whom Keller fears are doing. Claiming that they live in a world with “no context” (ah, the idiocy that expression!) is equally demeaning and implausible. It just that it’s a context that both journalists and parents have difficulty understanding.

What cheeses me off about all of this is that we could be having a real debate about what the implications of digital technology are, instead of playing out this tired, old are-you-for-or-against-the-Internet trope which by now feels extremely dated. Keller’s criticism looks oddly similar to that of Frank Schirrmacher, another journalist, and editor of the German newspaper Frankfurter Allgemeine Zeitung. Schirrmacher described the Internet as “a threat to our brains” in a book he published in 2009. Funny just how many “credible digital Cassandras” (Keller) decide to spread their warnings by the means of well-publicized books. They speak at conferences, peddle their “criticism” on TV shows and write thoughtful newspaper op eds reminding us of a simpler time when information was scarce and (comparably) easy to control and monetize. They remind us, the silly, star-eyed public, that not everything related to the Internet is teh awesome, but that some things there are smutty, bad and dangerous and that we must watch out before our children succumb to technology’s evil influences, which is best achieved by reading their thoughtful, balanced-yet-critical books. Except that these are mostly tired enumerations of speculations, soundbites from neurologists related as closely to Facebook’s effect on society as to the effect on brain-eating zombies on our mental health, and uninformed, extremely self-referential deliberations how the Internet is scary to elites based on how they are using it. They exaggerate the impact of digital technology and its uniqueness because if you’re in the horse and buggy industry, nothing is scarier than the automobile. And finally, they assume that everyone uses Twitter, Facebook and other services in the same way, which, as it turns out, is not true.

There is of course, a lot that can go wrong in the future. Data is increasingly treated as capital, and those who are producing it aren’t the ones owning the inferences mined from it about attitudes, behaviors and consumer choices. Despite the clamor that through the Interent information is available to EVERYONE, it’s neither true that everyone has access nor that we’re even all using the same Internet. Censorship, privacy, I could go on. But why bother with these complicated issues if you, as the executive editor of a leading newspaper, can instead lament about how you’re terribly conflicted about all this change that’s going on? Why acknowledge that the situation is complex and has many facets when you can instead troll a bit and get a lot of “how dare yous” and “finally someone says its” in response? Funny how even a piece about the dangers of Twitter has to be, you know, debatable on Twitter. 

Come to think of it, this would make a stellar title for a book:

OMG! ZOMBIES! How the hysteria of our social elites about the Internet is keeping us from engaging in a serious discussion.

Tagged with:  

Liebe Twitter-Nutzerin,
Lieber Twitter-Nutzer,

Ich bin Sprachwissenschaftler an der Universität Düsseldorf und beschäftige mich schwerpunktmäßig mit Internetkommunikation. Als Teil der Studie “Aspekte privater Twitter-Kommunikation” möchte die Nutzungsgewohnheiten von deutschsprachigen Twitter-Nutzern untersuchen, die Twitter nicht ausschließlich beruflich einsetzen (im Gegensatz zu z.B. Journalisten, Wissenschaftlern, Politikern, und anderen Menschen in Kommunikationsberufen). Zu diesem Zweck würde ich gerne deine öffentlichen Tweets einen Monat lang aufzeichnen und auswerten. Anschließend würde ich dir gerne per Mail einige Fragen (nicht mehr als 10) zu deiner Twitter-Nutzung stellen.

Es werden ausschließlich öffentliche Tweets (also keine DMs) aufgezeichnet. Sämtliche Daten werden anonymisiert (d.h. Namen — auch Twitter-Nicknames — entfernt) und nicht an Dritte weitergegeben. Einzelne Tweets können über das Hashtag #exclude jeder Zeit aus der Aufzeichnung ausgeschlossen werden. Am Ende des Untersuchungszeitraum schicke ich dir bei Interesse gerne ein Archiv deiner aufgezeichneten Tweets zu.

Neben deinem Beitrag zur wissenschaftlichen Forschung winkt auch eine (kleine) Aufwandsentschädigung: ich verlose am Ende des Untersuchungszeitraum unter den Teilnehmern einen Amazon-Gutschein im Wert von 50 Euro. :-)

Wenn du zu einer Teilnahme bereit bist, schicke bitte eine kurze Mail an Cornelius.Puschmann@uni-duesseldorf.de (Edit: natürlich kannst du dich auch per Twitter melden). Falls du nicht teilnehmen möchtest, musst du nichts weiter tun. Fragen zur Studie beantworte ich gerne per Mail.

Schon jetzt vielen Dank für dein Interesse und deine Unterstützung!

Dr. Cornelius Puschmann
Nachwuchsforschergruppe “Wissenschaft und Internet”
Heinrich-Heine-Universität Düsseldorf

Tagged with:  

As part of the research we’re doing in Düsseldorf on the use of Twitter at academic conferences, here’s a poster we’re presenting in a few days at GOR ’11:

Here’s the citation for the poster:

Puschmann, C., Weller, K., & Dröge, E. (2011). Studying Twitter conversations as (dynamic) graphs: visualization and structural comparison. Presented at General Online Research, 14-16 March 2011, Düsseldorf, Germany. Retrieved from http://ynada.com/posters/gor11.pdf.

See this older post for more information on how to visualize dynamic graphs of retweets with Gephi.

Tagged with:  

I thought I’d write a brief update to this earlier post discussing the consequences of what has recently happened with Twitter’s TOS update/enforcement of the redistribution clause. Here is a concise summary from ReadWriteWeb:

[..] Twitter’s recent announcement that it was no longer granting whitelisting requests and that it would no longer allow redistribution of content will have huge consequences on scholars’ ability to conduct their research, as they will no longer have the ability to collect or export datasets for analysis.

Read this earlier RWW post for more background. Twitter has cracked down on services like TwapperKeeper and 140kit.com that allow users not only to track Twitter keywords and hashtags, but also to export and download archives of tweets in XML or CSV. Apparently Twitter wants to stop redistribution of “its” content to the extent possible, including redistribution for research purposes. From the RWW post:

140kit offered its Twitter datasets to other scholars for their own research. By no means a full or complete scraping of Twitter data, this information that the project had collected was still made available for download (for free) to researchers. But no longer.

The people at 140kit, to their credit, are working on an approach which would allow researchers to work with Twitter data without exporting data, but rather by using their interface. From 140kit’s website:

We have a solution, which will involve using a plugin based analytical approach, which will not allow you to export data, but will, with Twitter’s blessings, allow you to ask any questions to your dataset with ease.

Hmm, sorry, but I’m underwhelmed. There are already countless services out there that allow Twitter analysis in some form, often with nebulous results, because data collection and methods are not transparent. With any list of frequent terms on Twitter the question needs to be What stop words did you exclude? How clean is your data? I can’t know whether these things are done appropriately for my analysis unless I do them myself. You might object that not everyone is keen on sifting through CSV files with their own scripts. That’s true outside of academic research — for a casual analysis using a GUI tool for Twitter analysis might be okay — but for serious analysis direct access to the raw data itself is a must. And beyond just having access yourself, in the spirit of reproducible research it’s important to distribute the dataset along with your paper. That’s where we should be heading, rather than basing our analyses on pre-produced tools and mechanisms which handle the data in ways which are intransparent and beyond our control.

Will this shut off researcher’s access to Twitter data, as the RWW article claims? Not really, at least not everyone’s access. Those researchers who build their own tools (or deploy existing ones, such as yourTwapperKeeper, on their own servers) will have no trouble at all getting all the data they want. It’s just the rest — those who can’t code, or lack tech support (=funding) who will be restricted to simple GUI tools. If you’re a PhD student at a small university, in a department with no technical expertise or support, you have a competitive disadvantage. More power to computer scientists, and to centers like Berkman and the OII, this decision seems to say.

How to solve this problem? Luckily services like Amazon AWS level the playing field somewhat. Setting up and account there to scrape Twitter on a regular basis (for example with yourTwapperKeeper, or with your own set of scripts) is probably the best alternative to using a service like 140kit.

Note: Check out this video interview with John O’Brian of TwapperKeeper, who basically gives the same advice.

Tagged with: