After recently teaching an introductory class to R aimed at linguists at the University of Bayreuth, I’ve decided to put my extended notes on the website in the form of a very basic tutorial. Check out Corpus and Text Linguistics with R (CTL-R) if you want to learn R fundamentals and have no prior programming experience. It’s still incomplete at present, but I hope to have more chapters ready soon. Happy R-ing! :-)

Tagged with:  

Unfortunately I’m not able to attend the annual IPrA conference next week in Manchester and had to cancel the trip short notice. I was scheduled to give a talk as part of the session Quoting in Computer-mediated Communication on my work with Katrin Weller on retweeting among scientists.

Luckily for me, there will be a follow-up event of sorts (see below). I’ve posted the call here since it doesn’t seem to be available on the Web other than as a PDF. Submit something if you’re doing research on quoting! I’m fairly sure that the deadline will be extended by a week or two.

CfP: Quoting Now and Then – 3rd International Conference on Quotation and Meaning (ICQM)

University of Augsburg, Germany

19 April – 21 April 2012

Conference Convenors:
Wolfram Bublitz
Jenny Arendholz
Christian Hoffmann
Monika Kirner

Contact: Monika Kirner

Call for Papers
This conference addresses the pragmatics of quoting as a metacommunicative act both in old (printed) and new (electronically mediated) communication. With the rapid evolution of new media in the last two decades, approaches to the study of (forms, functions and impact of) quoting have been gaining momentum in linguistics. Although quotations in print media have already been investigated to some extent, quoting in computer-mediated communication is still unchartered territory. This conference shall focus on the formal and functional evolution of quoting from old (analog) to new (digital) media. While the conference builds on the panel “Quoting in Computer-mediated Communication” to be presented in July 2011 at the International Conference of Pragmatics (IPrA), it assumes a much broader perspective, paying special tribute to the inherent confluence and complementarity of synchronic and diachronic approaches. Consequently, we invite papers from both (synchronic and diachronic) perspectives to report on the formal, functional as well as the pragmatic-discursive and multimodal nature of quoting in different genres or media.

Plenary talk: Jörg Meibauer

Please submit an abstract of not more than 500 words (for a 30 min talk plus 10 min discussion) via e-mail to

Deadline for abstracts:
1 July 2011
15 August 2011

Tagged with:  

Note: The title of this post is somewhat misleading — the paper in question appears to be the most widely cited paper in Language, not necessarily in linguistics.

Mark Dingemanse has posted an interesting analysis of the LSA‘s recent survey for their anthology of Language. From the survey:

For each volume of the Anthology, we are seeking input on those articles which represent the best scholarship published during that particular period. By “best,” we mean the most influential, the most cited, the most visited in JSTOR, and those considered a must-read for students and scholars of the discipline.

Mark has put together a spreadsheet showing the ranking of six popular Language articles in terms of how often they are viewed on JSTOR and added citation information from Google Scholar to the ranking. The result are interesting for several reasons and I wanted to briefly remark on them.

Note: have a look at Mark’s spreadsheet for more detail than is visible in the chart above.

Mark points out that the 1974 paper A Simplest Systematics for the Organization of Turn-Taking for Conversation by Sacks et al has a remarkable lead when it comes to the number of citations. The first thing to consider in my view is that Google Scholar is not entirely accurate (see this; there are other, more recent studies showing attribution problems persist). Google Scholar’s greatest sin in the eyes of most librariens is that it largely ignores metadata and instead resorts to text mining approaches to determine things such as author and publication name. I use GS regularly and it’s a fantastic resource, but its citation counts should be taken with a grain of salt, to put it mildly.

My other argument is something I cannot fully back up, but that seems very plausible to me: Sacks at al is much more accessible than other highly rankend papers.

Accessible in what sense?

  1. The topic of the paper makes it relevant to scholars in other disciplines. The clear, non-technical and theory-agnostic title adds to this. People find papers via search and you can’t search for terms you aren’t familiar with.
  2. There are multiple open access PDF copies of the paper available that one can download without access to JSTOR (here, here and here — note that two copies are stored on the websites of a language and social interaction program and a computer science department).
  3. The paper is cited in the Wikipedia article on conversation analysis (in fact, this is the highest rankend Google hit when searching for the exact name of the article, even before the JSTOR page).

If you compare this with the other top-ranked papers you’ll come to the conclusion that

  1. their subject and scope and how it is reflected in the exact wording of the title makes them less relevant to other disciplines,
  2. they aren’t accessible except through JSTOR,
  3. they aren’t referenced in Wikipedia (because they aren’t accessible).

Of course my argumentation is somewhat skewed if we assume that both the citation figures and the numbers from JSTOR might not be entirely accurate. The #2 paper in JSTOR (Curtiss el al) is likely to have a large number of views because it pops up as #1 search result when searching for Genie on JSTOR, because it is fairly ambiguous, and because it is related to a spectacular and tragic incident.

Do linguists (and scholars in general) take the second and third argument into account? My impression is they don’t, at least not enough. Even the Language Anthology will not be openly acessible, although many popular texts are de facto available, whether legally or not (e.g. Chomsky’s review of Skinner). We should aim to make more of our research — past and present — both technically and legally available on the Internet. This will benefit colleagues from other fields, the general public, and ultimately linguistics as a discipline.

Tagged with:  

Thoughts on the LSA data sharing resolution

On January 14, 2010, in Thoughts, by cornelius

At it’s recent annual meeting in Baltimore, the Linguistic Society of America (LSA) passed a resolution on data sharing that is the result of a series of discussions that took place last year, for example at the meeting of the Cyberlinguistics group in Berkeley last June.

Here’s the text (snip):

Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and

Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and

Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; and

Whereas working with linguistic data requires computational tools supporting analysis and collaboration in the field, including standards, analysis tools, and portals that bring together linguistic data and tools to analyze them,

Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to:

  • make the full data sets behind publications available, subject to all relevant ethical and legal concerns;
  • annotate data and provide metadata according to current standards and best practices;
  • seek wherever possible institutional review board human subjects approval that allows full recordings and transcripts to be made available for other research;
  • contribute to the development of computational tools which support the analysis of linguistic data;
  • work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and
  • when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.
  • I think it’s great that the LSA is throwing its weight behind this effort and supporting the idea of data sharing. The only minor complaint that I have concerns the wording – what exactly does make available mean? It could mean real Open Access, but also that you’ll email me your datasets if I ask nicely. Or it could mean that a publisher will make your datasets available for a fee – any of these approaches qualify as making data available in this terminology.

    So, while I think this is good starting point, more discussion is needed. Especially when it comes to formats, means of access and licensing we need to be more explicit.

    Imagine this scenario for a moment: you want to compare the semantic prosody of the verb cause across a dozen languages. If data sharing (and beyond that, resource sharing) were already a reality, we could do something like this:

    1. Send a query to WordNetAPI* to identify the closest synonyms of cause in the target languages.
    2. Send a query to UnversalCorpusAPI* using the terms we have just identified and specifying a list of megacorpora that we want to search in.
    3. Retrieve the result in TEI-XML.
    4. Analyze the results in R using the XML package.

    The decisive advantage here would be that I only get the data I need, not everything else that’s in those megacorpora that is unrelated to my query. Things just need to be in XML and openly available and I can continue to process them in other ways. This would not just be sharing, but embedding your data in an infrastructure that makes it usable as part of a service. And that would be neat because what good is the data really if it doesn’t come with the tools needed to analyze it? And in 2010 tools=services, not locally installed software.

    Now that would be awesome.

    (*) fictional at this point, but technically quite feasible.

    Tagged with:  

    (Edit 9/7/2009: lexicographer, digital humanist and webdev wizard Toma Tasovac has taken the time to translate this post into Serbian. Thank you, Toma!)

    In the course of the last half year or so, I’ve had the chance to get a much broader impression of the research being done by colleagues from other disciplines in Internet Studies/Internet Research. I consider myself an online researcher head to toe, but my background in linguistics means that I approach my object of study from a slightly different direction than a sociologist, a social psychologist or a mass communications scholar would. These differences are minor and much of the time they fade into the background, but there are situations when they do become visible. I want to outline how I think these nuances of difference can benefit and enrich the study of online communication, because they allow us – scholars with various backgrounds, or, more precisely, different disciplinary origins – to learn from one another. Specifically, I want to make the argument that analyzing how people communicate online can provide us with insights not only about the social and cultural dynamics of the Internet, but also with valuable data on how online communication is conceptualized, in other words, of how we think the Internet and what we are doing when we use it to express ourselves. An aspect I want to address in passing with this argument is an artificial dichotomy that I feel has proven itself to be counterproductive: the split between the cognitive and the social and cultural dimensions of communication.

    A sociologist is likely to be particularly interested in the Internet’s potential for social interaction and in where, how and why this potential is realized. Of course this is also likely to be relevant (for example) to the (socio)linguist, but inevitably her focus is on language first and on community second, the latter seen as a key factor shaping the former. The implicit argument that many linguists follow is that language is shaped by cognition as well as social convention, and while it would be futile to untangle the two from one another, it is possible to point out their individual influences.

    Internet communication happens through a variety of channels. It can be spoken or take place via video, but a significant percentage is typed via keyboards and touchscreens. A linguistically-oriented approach to computer-mediated communication is akin to an archeology of Internet Studies in that it starts with the smallest units of typed communication and works its way up incrementally: from words to sentences to documents, to pieces of discourse to genres of communication and beyond that to their form and function. Obviously doing this is not an activity restricted to linguistics – researchers from countless disciplines do it and bring their own methods and approaches to the table. In some instances the narrow focus of linguistics on language in CMC research can be a limitation when it fails to contextualize observations about a genre with the social context that shapes said genre and the role it plays for its community of users. But it is worth pointing out which aspects of language use online are shaped by universal communicative principles, and not the conventions of individual communities or users, not because this lessens the importance of said conventions in any way, but because it allows us to understand online communication in its entirety better.

    The question at the heart of the cognitive dimension of Internet Studies is: When we communicate on the Internet, what exactly is it we think we are doing and where and with whom do we think we are doing it?

    The question may seem strange at first – one could argue that we are having conversations or chatting on Twitter, that we are writing a diary or publishing an opinion piece or rant in our blog. But the words we use to describe these activities reveal our association of new concepts with familiar ones. The blueprint of a conversation is a face-to-face exchange via sound waves between people who are in proximity to one another, so that these sound waves can travel from one participant to the other and trigger inferential processes in their heads. Publishing traditionally describes the production and dissemination of printed documents, such as books, magazines etc. In other words, almost all of the terms we use to describe what we do online are metaphorical extensions of pre-existing concepts (surf, chat, browse, search). Interestingly, those forms of online communication that go beyond pre-digital metaphors have given rise to their own vocabulary (e.g. blog, tweet) and as natively digital communication evolves (e.g. Google Wave) we take up more and more practices that cannot be described in terms of what is already familiar from pre-digital contexts.

    In other words, a wide variety of seemingly mundane practices of online communication are shaped by complex and increasingly unstable metaphors. Many of these metaphors are dependent on cultural convention, but some are also cognitively salient and universal. For example, people talk about websites as if they were places (using words like “here”, “there”, “on that site”) not because they are taught to do so, but because space and spatial orientation apparently lends itself well to thinking the Internet (as well as many other abstract concepts).

    The Internet gives us access to language data on an unprecedented scale, but it would be a shame if all with did with it was to study words and sentences in articifical isolation. We would be ignoring the process in its entirity and missing the larger picture, a picture that only multidisciplinary teams with a variety of methods can accurately draw. At the same time, there are dimensions of Internet Studies where the focus on social aspects alone misses important things. Why do so many people blog and tweet in relative insolation, reporting thoughts and states that others do not respond to and are perhaps not meant to respond to? Can the creation of (social) media be accounted for by notions such as social capital alone, or is there an inherent psychological salience of digital media as a mirror, a permanent diachronic record of the self? What about the non-social dimension of social media?

    Let me know what you think.