I thought I’d write a brief update to this earlier post discussing the consequences of what has recently happened with Twitter’s TOS update/enforcement of the redistribution clause. Here is a concise summary from ReadWriteWeb:
[..] Twitter’s recent announcement that it was no longer granting whitelisting requests and that it would no longer allow redistribution of content will have huge consequences on scholars’ ability to conduct their research, as they will no longer have the ability to collect or export datasets for analysis.
Read this earlier RWW post for more background. Twitter has cracked down on services like TwapperKeeper and 140kit.com that allow users not only to track Twitter keywords and hashtags, but also to export and download archives of tweets in XML or CSV. Apparently Twitter wants to stop redistribution of “its” content to the extent possible, including redistribution for research purposes. From the RWW post:
140kit offered its Twitter datasets to other scholars for their own research. By no means a full or complete scraping of Twitter data, this information that the project had collected was still made available for download (for free) to researchers. But no longer.
The people at 140kit, to their credit, are working on an approach which would allow researchers to work with Twitter data without exporting data, but rather by using their interface. From 140kit’s website:
We have a solution, which will involve using a plugin based analytical approach, which will not allow you to export data, but will, with Twitter’s blessings, allow you to ask any questions to your dataset with ease.
Hmm, sorry, but I’m underwhelmed. There are already countless services out there that allow Twitter analysis in some form, often with nebulous results, because data collection and methods are not transparent. With any list of frequent terms on Twitter the question needs to be What stop words did you exclude? How clean is your data? I can’t know whether these things are done appropriately for my analysis unless I do them myself. You might object that not everyone is keen on sifting through CSV files with their own scripts. That’s true outside of academic research — for a casual analysis using a GUI tool for Twitter analysis might be okay — but for serious analysis direct access to the raw data itself is a must. And beyond just having access yourself, in the spirit of reproducible research it’s important to distribute the dataset along with your paper. That’s where we should be heading, rather than basing our analyses on pre-produced tools and mechanisms which handle the data in ways which are intransparent and beyond our control.
Will this shut off researcher’s access to Twitter data, as the RWW article claims? Not really, at least not everyone’s access. Those researchers who build their own tools (or deploy existing ones, such as yourTwapperKeeper, on their own servers) will have no trouble at all getting all the data they want. It’s just the rest — those who can’t code, or lack tech support (=funding) who will be restricted to simple GUI tools. If you’re a PhD student at a small university, in a department with no technical expertise or support, you have a competitive disadvantage. More power to computer scientists, and to centers like Berkman and the OII, this decision seems to say.
How to solve this problem? Luckily services like Amazon AWS level the playing field somewhat. Setting up and account there to scrape Twitter on a regular basis (for example with yourTwapperKeeper, or with your own set of scripts) is probably the best alternative to using a service like 140kit.
Note: Check out this video interview with John O’Brian of TwapperKeeper, who basically gives the same advice.
Update: I’ve written a follow-up to this post.
A few days ago, the people behind Twitter archival site TwapperKeeper.com announced that they will be discontinuing the export feature of the service on March 20, 2011. Apparently the feature is in violation of Twitter’s terms of service, at least in the form it’s currently implemented in TwapperKeeper.
Unfortunately this cuts off a number of academics who are investigating communication on Twitter for scientific purposes from a convenient data source. While it’s fairly easy to get data directly via the Twitter API (which is what TwapperKeeper was doing), I know many people who want to concentrate on the data itself, rather than running their own servers to scrape Twitter on a regular basis. What’s more is that Twitter’s attitude is worrisome: many of us have tried to get an exemption from API rate limits in the past, to no avail. Twitter doesn’t give researchers privileged access to their data, and now they’re crippling TwapperKeeper on top of that.
Bottom line: what will we use after March 20? Ideally, a replacement would provide the following:
- the hashtag/search query functionality of TwapperKeeper,
- the export functionality of TwapperKeeper,
- exclusive use for academic purposes (on the grounds that this might keep Twitter from shutting it down),
- stability and reliability,
- long-term viability.
The last point is important, because I don’t think it will be difficult to set up a server somewhere to suit the needs of a few people, but a larger-scale solution seems more sensible in the long run. Maybe JISC can do something like that, based on yourTwapperKeeper (which they supported)? Or one of the big institutes (OII, Berkman)? Either way it would be nice to find an alternative that doesn’t give those of us with devs and major IT support behind them a huge edge over the rest…