One week of Scientwist tweeting (January 18 – 25)

On January 27, 2010, in data, by cornelius

Here’s a list of URLs and hashtags that were popular among the @scientwists community last week. I realize that this is just a long enumeration, but I’m planning to publish these stats in a more concise format in the near future.

January 18th

January 19th

January 20th

January 21st

January 22nd

January 23rd

January 24th

January 25th

Tagged with:  

Since starting the Scientwists Project a bit over a week ago, I’ve been busy hacking up Bash and R scripts in order to analyze the data produced by the 500+ scholars that I’m following. Here’s a first glimpse of what they’ve been tweeting about, specifically the URLs and hashtags they’ve used.

In total, I’ve collected about 12.000 tweets since January 7th, containing 4.750 different URLs and 1.130 different hashtags.

10 most popular URLs

1. The Shorty Awards

2. Dennis Meadows: The Oil Drum: Economics and Limits to Growth: What’s Sustainable?

3. Björn Brembs: Social filtering of scientific information – a view beyond Twitter

4. BioData Product Blog: Laboratory Notebooks: A thing of the past?

5. Illumina’s Cheap New Gene Machine

6. A photograph of clouds that seem to resemble Great Britain :-)

7. Times Online: Baroness Greenfield loses her job in Royal Institution shake-up

8. Mr. Gunn: Cell launches a new format for the presentation of research articles online

9. Daniel Mietchen: On the need for a global academic internet platform [ref to Nadja Kutz:]

10. Rebecca Skloot: The Immortal Life of Henrietta Lacks

These were tweeted between 5 (#9 and #10) and 30 (#1) times. However, tracking URLs is complicated by the fact that many different addresses may point to the same source, especially since people use a variety of different URL shorteners. This is something I’ll resolve later, so for now this fairly anecdotal.

15 most popular hashtags

1. #scio10 (391x)
2. #scidebate (84x)
3. #fb (75x)
4. #science (68x)
5. #technology (67x)
6. #tcot (58x)
7. #orca (54x)
8. #debateanatel (53x)
9. #Glee (31x)
10. #ff (27x)
11. #HeLa (26x)
12. #uksnow (26x)
13. #Haiti (25x)
14. #NetDE (24x)
15. #gov20 (21x)

Obviously some of these are automatically generated (#fb and #ff), but there’s a fair share of interesting ones. I’m expecting #scio10 will dominate the next few days even more visibly.

Hope it’s informative – let me know if you have any questions. :-)

Thoughts on the LSA data sharing resolution

On January 14, 2010, in Thoughts, by cornelius

At it’s recent annual meeting in Baltimore, the Linguistic Society of America (LSA) passed a resolution on data sharing that is the result of a series of discussions that took place last year, for example at the meeting of the Cyberlinguistics group in Berkeley last June.

Here’s the text (snip):

Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and

Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and

Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; and

Whereas working with linguistic data requires computational tools supporting analysis and collaboration in the field, including standards, analysis tools, and portals that bring together linguistic data and tools to analyze them,

Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to:

  • make the full data sets behind publications available, subject to all relevant ethical and legal concerns;
  • annotate data and provide metadata according to current standards and best practices;
  • seek wherever possible institutional review board human subjects approval that allows full recordings and transcripts to be made available for other research;
  • contribute to the development of computational tools which support the analysis of linguistic data;
  • work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and
  • when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.
  • I think it’s great that the LSA is throwing its weight behind this effort and supporting the idea of data sharing. The only minor complaint that I have concerns the wording – what exactly does make available mean? It could mean real Open Access, but also that you’ll email me your datasets if I ask nicely. Or it could mean that a publisher will make your datasets available for a fee – any of these approaches qualify as making data available in this terminology.

    So, while I think this is good starting point, more discussion is needed. Especially when it comes to formats, means of access and licensing we need to be more explicit.

    Imagine this scenario for a moment: you want to compare the semantic prosody of the verb cause across a dozen languages. If data sharing (and beyond that, resource sharing) were already a reality, we could do something like this:

    1. Send a query to WordNetAPI* to identify the closest synonyms of cause in the target languages.
    2. Send a query to UnversalCorpusAPI* using the terms we have just identified and specifying a list of megacorpora that we want to search in.
    3. Retrieve the result in TEI-XML.
    4. Analyze the results in R using the XML package.

    The decisive advantage here would be that I only get the data I need, not everything else that’s in those megacorpora that is unrelated to my query. Things just need to be in XML and openly available and I can continue to process them in other ways. This would not just be sharing, but embedding your data in an infrastructure that makes it usable as part of a service. And that would be neat because what good is the data really if it doesn’t come with the tools needed to analyze it? And in 2010 tools=services, not locally installed software.

    Now that would be awesome.

    (*) fictional at this point, but technically quite feasible.

    Tagged with:  

    Microblogging services such as Twitter and FriendFeed appear to be steadily gaining popularity among academics for work-related purposes (communication at conferences, discussion of publications, casual conversation). As part of a larger project on the evolution of scholarly communication I am today launching a study of academic uses of Twitter across disciplines.

    One component of this study will be a corpus of tweets by international scholars from different fields over the course of one year. This corpus will be assembled via the account @scientwists, an automated user controlled via the Twitter API, and made available in the public domain after completion. The @scientwists account will follow a list of scholars put together from several sources, starting with this list assembled by David Bradley.*

    The corpus will be anonymized, i.e. user names will not be legible. It will also be possible to exclude individual posts from the corpus via use of the hashtag #exclude. However, if you receive a notification that @scientwists is following you and you would prefer for your tweets not to be included in the corpus at all, please simply block @scientwists.

    If you have questions or suggestions, please be sure to contact me on Twitter or via email.

    - Cornelius Puschmann, Heinrich-Heine-Universität Düsseldorf (about me)

    Note: if you are not an academic and are being followed by @scientwists2 you have been randomly included in the control group for this study. Please block @scientwists2 if you prefer your tweets not to be used.

    Tagged with: