Edit: this post on (legal aspects of) data sharing by Creative Commons’ Kaitlin Thaney is also highly recommended.

Edit #2: This is another cross post with cyberling.org.

If you’re involved in academic publishing — whether as a researcher, librarian or publisher — data sharing and data publishing are probably hot issues to you. Beyond its versatility as a platform for the dissemination of articles and ebooks, the Internet is increasingly also a place where research data lives. Scholars are no longer restricted to referring to data in their publications or including charts and graphs alongside the text, but can link directly to data published and stored elsewhere, or even embed data into their papers, a process facilitated by standards such as the Resource Description Framework (RDF).

Journals such as Earth System Science Data and the International Journal of Robotics Research give us a glimpse at how this approach might evolve in the future — from journals to data journals, publications which are concerned with presenting valuable data for reuse and pave the way for a research process that is increasingly collaborative. Technology is gradually catching up with the need for genuinely digital publications, a need fueled by the advantages of able to combine text, images, links, videos and a wide variety of datasets to produce a next-generation multi-modal scholarly article. Systems such as Fedora and PubMan are meant to facilitate digital publishing and assure best-practice data provenance and storage. They are able to handle different types of data and associate any number of individual files with a “data paper” that documents them.

However, technology is the much smaller issue when weighing the advantages of data publishing with its challenges — of which there are many, both to practitioners and to those supporting them. Best practices on the individual level are cultural norms that need to be established over time. Scientists still don’t have sufficient incentives to openly share their data, as tenure processes are tied to publishing results based on data, but not on sharing data directly. And finally, technology is prone to failure when there are no agreed-upon standards guiding its use and such standards need to be gradually (meaning painfully slowly, compared with technology’s breakneck pace) established  accepted by scholars, not decreed by committee.

In March, Jonathan Rees of NeuroCommons (a project within Creative Commons/Science Commons) published a working paper that outlines such standards for reusable scholarly data. One thing I really appreciate about Rees’ approach is that it is remarkably discipline-independent and not limited to the sciences (vs. social science and the humanities).

Rees outlines how data papers differ from traditional papers:

A data paper is a publication whose primary purpose is to expose and describe data, as opposed to analyze and draw conclusions from it. The data paper enables a division of labor in which those possessing the resources and skills can perform the experiments and observations needed to collect potentially interesting data sets, so that many parties, each with a unique background and ability to analyze the data, may make use of it as they see fit.

The key phrase here (which is why I couldn’t resist boldfacing it) is division of labor. Right now, to use an auto manufacturing analogy, a scholar does not just design a beautiful car (an analysis in the form of a research paper that culminates in observations or theoretical insights), he also has to build an engine (the data that his observations are based on). It doesn’t matter if she is a much better engineer than designer, the car will only run (she’ll only get tenure) if both the engine and the car meet the same requirements. The car analogy isn’t terribly fitting, but it serves to make the point that our current system lacks a division of labor, making it pretty inefficient. It’s based more on the idea of producing smart people than on the idea of getting smart people to produce reusable research.

Rees notes that data publishing is a complicated process and lists a set of rules for successful sharing of scientific data.

From the paper:

  1. The author must be professionally motivated to publish the data
  2. The effort and economic burden of publication must be acceptable
  3. The data must become accessible to potential users
  4. The data must remain accessible over time
  5. The data must be discoverable by potential users
  6. The user’s use of the data must be permitted
  7. The user must be able to understand what was measured and how (materials and methods)
  8. The user must be able to understand all computations that were applied and their inputs
  9. The user must be able to apply standard tools to all file formats

At a glance, these rules signify very different things. #1 and #2 are preconditions, rather than prescriptions while #3 – #6 are concerned with what the author needs to do in order to make the data available. Finally, rules #7 – #10 are corned with making the data as useful to others as possible. Rules #7 -#10 are dependent on who “the user” is and qualify as “do-this-as-best-as-you-can”-style suggestions, rather than strict requirements, not because they aren’t important, but because it’s impossible for the author to guarantee their successful implementation. By contrast, #3 -#6 are concerned with providing and preserving access and are requirements — I can’t guarantee that you’ll understand (or agree with) my electronic dictionary on Halh Mongolian, but I can make sure it’s stored in an institutional or disciplinary repository that is indexed in search engines, mirrored to assure the data can’t be lost and licensed in a legally unambiguous way, rather that upload it to my personal website and hope for the best when it comes to long-term availability, ease of discovery and legal re-use.

Finally, Rees gives some good advice beyond tech issues to publishers who want to implement data publishing:

Set a standard. There won’t be investment in data set reusability unless granting agencies and tenure review boards see it as a legitimate activity. A journal that shows itself credible in the role of enabling reuse will be rewarded with submissions and citations, and will in turn reward authors by helping them obtain recognition for their service to the research community.

This is critical. Don’t wait for universities, grant agencies or even scholars to agree on standards entirely on their own — they can’t and won’t if they don’t know how digital publishing works (legal aspects included). Start an innovative journal and set a standard yourself by being successful.

Encourage use of standard file formats, schemas, and ontologies. It is impossible to know what file formats will be around in ten years, much less a hundred, and this problem worries digital archivists. Open standards such as XML, RDF/XML, and PNG should be encouraged. Plain text is generally transparent but risky due to character encoding ambiguity. File formats that are obviously new or exotic, that lack readily available documentation, or that do not have non-proprietary parsers should not be accepted. Ontologies and schemas should enjoy community acceptance.

An important suggestion that is entirely compatible with linguistic data (dictionaries, word lists, corpora, transcripts, etc) and simplified by the fact that we have comparably small datasets. Even a megaword corpus is small compared to climate data or gene banks.

Aggressively implement a clean separation of concerns. To encourage submissions and reduce the burden on authors and publishers, avoid the imposition of criteria not related to data reuse. These include importance (this will not be known until after others work with the data) and statistical strength (new methods and/or meta-analysis may provide it). The primary peer review criterion should be adequacy of experimental and computational methods description in the service of reuse.

This will be a tough nut to crack, because it sheds tradition to a degree. Relevance was always high on the list of requirements while publications were scarce — paper costs money, therefor what was published had to important to as many people as possible. With data publishing this is no longer true — whether something is important or statistically strong (applying this to linguistics one might say representative, well-documented, etc) is impossible to know from the onset. It’s much more sensible to get it out there and deal with the analysis later, rather than creating an artificial scarcity of data. But it will take time and cultural change to get researchers (and funding both funding agencies and hiring committees) to adapt to this approach.

In the meantime, while we’re still publishing traditional (non-data) papers, we can at least work on making them more accessible. Something like arXiv for linguistics wouldn’t hurt.

Thoughts on the LSA data sharing resolution

On January 14, 2010, in Thoughts, by cornelius

At it’s recent annual meeting in Baltimore, the Linguistic Society of America (LSA) passed a resolution on data sharing that is the result of a series of discussions that took place last year, for example at the meeting of the Cyberlinguistics group in Berkeley last June.

Here’s the text (snip):

Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and

Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and

Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; and

Whereas working with linguistic data requires computational tools supporting analysis and collaboration in the field, including standards, analysis tools, and portals that bring together linguistic data and tools to analyze them,

Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to:

  • make the full data sets behind publications available, subject to all relevant ethical and legal concerns;
  • annotate data and provide metadata according to current standards and best practices;
  • seek wherever possible institutional review board human subjects approval that allows full recordings and transcripts to be made available for other research;
  • contribute to the development of computational tools which support the analysis of linguistic data;
  • work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and
  • when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.
  • I think it’s great that the LSA is throwing its weight behind this effort and supporting the idea of data sharing. The only minor complaint that I have concerns the wording – what exactly does make available mean? It could mean real Open Access, but also that you’ll email me your datasets if I ask nicely. Or it could mean that a publisher will make your datasets available for a fee – any of these approaches qualify as making data available in this terminology.

    So, while I think this is good starting point, more discussion is needed. Especially when it comes to formats, means of access and licensing we need to be more explicit.

    Imagine this scenario for a moment: you want to compare the semantic prosody of the verb cause across a dozen languages. If data sharing (and beyond that, resource sharing) were already a reality, we could do something like this:

    1. Send a query to WordNetAPI* to identify the closest synonyms of cause in the target languages.
    2. Send a query to UnversalCorpusAPI* using the terms we have just identified and specifying a list of megacorpora that we want to search in.
    3. Retrieve the result in TEI-XML.
    4. Analyze the results in R using the XML package.

    The decisive advantage here would be that I only get the data I need, not everything else that’s in those megacorpora that is unrelated to my query. Things just need to be in XML and openly available and I can continue to process them in other ways. This would not just be sharing, but embedding your data in an infrastructure that makes it usable as part of a service. And that would be neat because what good is the data really if it doesn’t come with the tools needed to analyze it? And in 2010 tools=services, not locally installed software.

    Now that would be awesome.

    (*) fictional at this point, but technically quite feasible.

    Tagged with:  

    Below are my OAW09 slides for last week’s presentation, held at the University of Cologne. We were a small but enthusiastic band of Open Access supporters and I greatly enjoyed the presentations, especially the one on ArcheoInf, which is a very impressive digital humanities/open data project in archeology.

    Tagged with: