Jan 14

At it’s recent annual meeting in Baltimore, the Linguistic Society of America (LSA) passed a resolution on data sharing that is the result of a series of discussions that took place last year, for example at the meeting of the Cyberlinguistics group in Berkeley last June.

Here’s the text (snip):

Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and

Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and

Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; and

Whereas working with linguistic data requires computational tools supporting analysis and collaboration in the field, including standards, analysis tools, and portals that bring together linguistic data and tools to analyze them,

Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to:

  • make the full data sets behind publications available, subject to all relevant ethical and legal concerns;
  • annotate data and provide metadata according to current standards and best practices;
  • seek wherever possible institutional review board human subjects approval that allows full recordings and transcripts to be made available for other research;
  • contribute to the development of computational tools which support the analysis of linguistic data;
  • work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and
  • when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.
  • I think it’s great that the LSA is throwing its weight behind this effort and supporting the idea of data sharing. The only minor complaint that I have concerns the wording – what exactly does make available mean? It could mean real Open Access, but also that you’ll email me your datasets if I ask nicely. Or it could mean that a publisher will make your datasets available for a fee – any of these approaches qualify as making data available in this terminology.

    So, while I think this is good starting point, more discussion is needed. Especially when it comes to formats, means of access and licensing we need to be more explicit.

    Imagine this scenario for a moment: you want to compare the semantic prosody of the verb cause across a dozen languages. If data sharing (and beyond that, resource sharing) were already a reality, we could do something like this:

    1. Send a query to WordNetAPI* to identify the closest synonyms of cause in the target languages.
    2. Send a query to UnversalCorpusAPI* using the terms we have just identified and specifying a list of megacorpora that we want to search in.
    3. Retrieve the result in TEI-XML.
    4. Analyze the results in R using the XML package.

    The decisive advantage here would be that I only get the data I need, not everything else that’s in those megacorpora that is unrelated to my query. Things just need to be in XML and openly available and I can continue to process them in other ways. This would not just be sharing, but embedding your data in an infrastructure that makes it usable as part of a service. And that would be neat because what good is the data really if it doesn’t come with the tools needed to analyze it? And in 2010 tools=services, not locally installed software.

    Now that would be awesome.

    (*) fictional at this point, but technically quite feasible.

    Tagged with:
    Aug 19

    (Edit 9/7/2009: lexicographer, digital humanist and webdev wizard Toma Tasovac has taken the time to translate this post into Serbian. Thank you, Toma!)

    In the course of the last half year or so, I’ve had the chance to get a much broader impression of the research being done by colleagues from other disciplines in Internet Studies/Internet Research. I consider myself an online researcher head to toe, but my background in linguistics means that I approach my object of study from a slightly different direction than a sociologist, a social psychologist or a mass communications scholar would. These differences are minor and much of the time they fade into the background, but there are situations when they do become visible. I want to outline how I think these nuances of difference can benefit and enrich the study of online communication, because they allow us – scholars with various backgrounds, or, more precisely, different disciplinary origins – to learn from one another. Specifically, I want to make the argument that analyzing how people communicate online can provide us with insights not only about the social and cultural dynamics of the Internet, but also with valuable data on how online communication is conceptualized, in other words, of how we think the Internet and what we are doing when we use it to express ourselves. An aspect I want to address in passing with this argument is an artificial dichotomy that I feel has proven itself to be counterproductive: the split between the cognitive and the social and cultural dimensions of communication.

    A sociologist is likely to be particularly interested in the Internet’s potential for social interaction and in where, how and why this potential is realized. Of course this is also likely to be relevant (for example) to the (socio)linguist, but inevitably her focus is on language first and on community second, the latter seen as a key factor shaping the former. The implicit argument that many linguists follow is that language is shaped by cognition as well as social convention, and while it would be futile to untangle the two from one another, it is possible to point out their individual influences.

    Internet communication happens through a variety of channels. It can be spoken or take place via video, but a significant percentage is typed via keyboards and touchscreens. A linguistically-oriented approach to computer-mediated communication is akin to an archeology of Internet Studies in that it starts with the smallest units of typed communication and works its way up incrementally: from words to sentences to documents, to pieces of discourse to genres of communication and beyond that to their form and function. Obviously doing this is not an activity restricted to linguistics – researchers from countless disciplines do it and bring their own methods and approaches to the table. In some instances the narrow focus of linguistics on language in CMC research can be a limitation when it fails to contextualize observations about a genre with the social context that shapes said genre and the role it plays for its community of users. But it is worth pointing out which aspects of language use online are shaped by universal communicative principles, and not the conventions of individual communities or users, not because this lessens the importance of said conventions in any way, but because it allows us to understand online communication in its entirety better.

    The question at the heart of the cognitive dimension of Internet Studies is: When we communicate on the Internet, what exactly is it we think we are doing and where and with whom do we think we are doing it?

    The question may seem strange at first – one could argue that we are having conversations or chatting on Twitter, that we are writing a diary or publishing an opinion piece or rant in our blog. But the words we use to describe these activities reveal our association of new concepts with familiar ones. The blueprint of a conversation is a face-to-face exchange via sound waves between people who are in proximity to one another, so that these sound waves can travel from one participant to the other and trigger inferential processes in their heads. Publishing traditionally describes the production and dissemination of printed documents, such as books, magazines etc. In other words, almost all of the terms we use to describe what we do online are metaphorical extensions of pre-existing concepts (surf, chat, browse, search). Interestingly, those forms of online communication that go beyond pre-digital metaphors have given rise to their own vocabulary (e.g. blog, tweet) and as natively digital communication evolves (e.g. Google Wave) we take up more and more practices that cannot be described in terms of what is already familiar from pre-digital contexts.

    In other words, a wide variety of seemingly mundane practices of online communication are shaped by complex and increasingly unstable metaphors. Many of these metaphors are dependent on cultural convention, but some are also cognitively salient and universal. For example, people talk about websites as if they were places (using words like “here”, “there”, “on that site”) not because they are taught to do so, but because space and spatial orientation apparently lends itself well to thinking the Internet (as well as many other abstract concepts).

    The Internet gives us access to language data on an unprecedented scale, but it would be a shame if all with did with it was to study words and sentences in articifical isolation. We would be ignoring the process in its entirity and missing the larger picture, a picture that only multidisciplinary teams with a variety of methods can accurately draw. At the same time, there are dimensions of Internet Studies where the focus on social aspects alone misses important things. Why do so many people blog and tweet in relative insolation, reporting thoughts and states that others do not respond to and are perhaps not meant to respond to? Can the creation of (social) media be accounted for by notions such as social capital alone, or is there an inherent psychological salience of digital media as a mirror, a permanent diachronic record of the self? What about the non-social dimension of social media?

    Let me know what you think.

    Tagged with:
    preload preload preload