After recently teaching an introductory class to R aimed at linguists at the University of Bayreuth, I’ve decided to put my extended notes on the website in the form of a very basic tutorial. Check out Corpus and Text Linguistics with R (CTL-R) if you want to learn R fundamentals and have no prior programming experience. It’s still incomplete at present, but I hope to have more chapters ready soon. Happy R-ing!
Unfortunately I’m not able to attend the annual IPrA conference next week in Manchester and had to cancel the trip short notice. I was scheduled to give a talk as part of the session Quoting in Computer-mediated Communication on my work with Katrin Weller on retweeting among scientists.
Luckily for me, there will be a follow-up event of sorts (see below). I’ve posted the call here since it doesn’t seem to be available on the Web other than as a PDF. Submit something if you’re doing research on quoting! I’m fairly sure that the deadline will be extended by a week or two.
CfP: Quoting Now and Then – 3rd International Conference on Quotation and Meaning (ICQM)
University of Augsburg, Germany
19 April – 21 April 2012
Contact: Monika Kirner
Call for Papers
This conference addresses the pragmatics of quoting as a metacommunicative act both in old (printed) and new (electronically mediated) communication. With the rapid evolution of new media in the last two decades, approaches to the study of (forms, functions and impact of) quoting have been gaining momentum in linguistics. Although quotations in print media have already been investigated to some extent, quoting in computer-mediated communication is still unchartered territory. This conference shall focus on the formal and functional evolution of quoting from old (analog) to new (digital) media. While the conference builds on the panel “Quoting in Computer-mediated Communication” to be presented in July 2011 at the International Conference of Pragmatics (IPrA), it assumes a much broader perspective, paying special tribute to the inherent confluence and complementarity of synchronic and diachronic approaches. Consequently, we invite papers from both (synchronic and diachronic) perspectives to report on the formal, functional as well as the pragmatic-discursive and multimodal nature of quoting in different genres or media.
Plenary talk: Jörg Meibauer
Please submit an abstract of not more than 500 words (for a 30 min talk plus 10 min discussion) via e-mail to email@example.com
Deadline for abstracts:
1 July 2011
15 August 2011
Note: The title of this post is somewhat misleading — the paper in question appears to be the most widely cited paper in Language, not necessarily in linguistics.
For each volume of the Anthology, we are seeking input on those articles which represent the best scholarship published during that particular period. By “best,” we mean the most influential, the most cited, the most visited in JSTOR, and those considered a must-read for students and scholars of the discipline.
Mark has put together a spreadsheet showing the ranking of six popular Language articles in terms of how often they are viewed on JSTOR and added citation information from Google Scholar to the ranking. The result are interesting for several reasons and I wanted to briefly remark on them.
Note: have a look at Mark’s spreadsheet for more detail than is visible in the chart above.
Mark points out that the 1974 paper A Simplest Systematics for the Organization of Turn-Taking for Conversation by Sacks et al has a remarkable lead when it comes to the number of citations. The first thing to consider in my view is that Google Scholar is not entirely accurate (see this; there are other, more recent studies showing attribution problems persist). Google Scholar’s greatest sin in the eyes of most librariens is that it largely ignores metadata and instead resorts to text mining approaches to determine things such as author and publication name. I use GS regularly and it’s a fantastic resource, but its citation counts should be taken with a grain of salt, to put it mildly.
My other argument is something I cannot fully back up, but that seems very plausible to me: Sacks at al is much more accessible than other highly rankend papers.
Accessible in what sense?
- The topic of the paper makes it relevant to scholars in other disciplines. The clear, non-technical and theory-agnostic title adds to this. People find papers via search and you can’t search for terms you aren’t familiar with.
- There are multiple open access PDF copies of the paper available that one can download without access to JSTOR (here, here and here — note that two copies are stored on the websites of a language and social interaction program and a computer science department).
- The paper is cited in the Wikipedia article on conversation analysis (in fact, this is the highest rankend Google hit when searching for the exact name of the article, even before the JSTOR page).
If you compare this with the other top-ranked papers you’ll come to the conclusion that
- their subject and scope and how it is reflected in the exact wording of the title makes them less relevant to other disciplines,
- they aren’t accessible except through JSTOR,
- they aren’t referenced in Wikipedia (because they aren’t accessible).
Of course my argumentation is somewhat skewed if we assume that both the citation figures and the numbers from JSTOR might not be entirely accurate. The #2 paper in JSTOR (Curtiss el al) is likely to have a large number of views because it pops up as #1 search result when searching for Genie on JSTOR, because it is fairly ambiguous, and because it is related to a spectacular and tragic incident.
Do linguists (and scholars in general) take the second and third argument into account? My impression is they don’t, at least not enough. Even the Language Anthology will not be openly acessible, although many popular texts are de facto available, whether legally or not (e.g. Chomsky’s review of Skinner). We should aim to make more of our research — past and present — both technically and legally available on the Internet. This will benefit colleagues from other fields, the general public, and ultimately linguistics as a discipline.
At it’s recent annual meeting in Baltimore, the Linguistic Society of America (LSA) passed a resolution on data sharing that is the result of a series of discussions that took place last year, for example at the meeting of the Cyberlinguistics group in Berkeley last June.
Here’s the text (snip):
Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and
Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and
Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; and
Whereas working with linguistic data requires computational tools supporting analysis and collaboration in the field, including standards, analysis tools, and portals that bring together linguistic data and tools to analyze them,
Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to:
make the full data sets behind publications available, subject to all relevant ethical and legal concerns; annotate data and provide metadata according to current standards and best practices; seek wherever possible institutional review board human subjects approval that allows full recordings and transcripts to be made available for other research; contribute to the development of computational tools which support the analysis of linguistic data; work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.
I think it’s great that the LSA is throwing its weight behind this effort and supporting the idea of data sharing. The only minor complaint that I have concerns the wording – what exactly does make available mean? It could mean real Open Access, but also that you’ll email me your datasets if I ask nicely. Or it could mean that a publisher will make your datasets available for a fee – any of these approaches qualify as making data available in this terminology.
So, while I think this is good starting point, more discussion is needed. Especially when it comes to formats, means of access and licensing we need to be more explicit.
Imagine this scenario for a moment: you want to compare the semantic prosody of the verb cause across a dozen languages. If data sharing (and beyond that, resource sharing) were already a reality, we could do something like this:
1. Send a query to WordNetAPI* to identify the closest synonyms of cause in the target languages.
2. Send a query to UnversalCorpusAPI* using the terms we have just identified and specifying a list of megacorpora that we want to search in.
3. Retrieve the result in TEI-XML.
4. Analyze the results in R using the XML package.
The decisive advantage here would be that I only get the data I need, not everything else that’s in those megacorpora that is unrelated to my query. Things just need to be in XML and openly available and I can continue to process them in other ways. This would not just be sharing, but embedding your data in an infrastructure that makes it usable as part of a service. And that would be neat because what good is the data really if it doesn’t come with the tools needed to analyze it? And in 2010 tools=services, not locally installed software.
Now that would be awesome.
(*) fictional at this point, but technically quite feasible.