This is my second report from the Berlin 9 Open Access Conference, this one summarizing Tuesday’s session on Open Access Policy. I’m still catching up on yesterday’s talks and will post those later today or early tomorrow.

The session was moderated by Alma Swan of Enabling Open Scholarship, also director of Key Perspectives Ltd. Alma introduced the panelists:

  • Bernard Rentier, Rector, Université de Liege
  • Stuart Shieber, Director, Office for Scholarly Communication, Harvard University
  • William Nixon, Digital Library Development Manager, University of Glasgow
  • Jeffrey Vitter, Provost and Executive Vice Chancellor, University of Kansas

After this, Alma laid out some of the key issues on which the presenters would focus in their talks, namely the precise wording of the institutional open access policy that they have put into place, the people involved in planning and implementing it, the nature of the implementation and finally the resources for ongoing support (as she pointed out, if there is no ongoing support, open acces does not work). Alma then proposed an elaborate typology of policies based on multiple factors, i.e. who retains rights, whether or not there is a waiver, when deposit takes place and whether or not there is an embargo on the full text or the article meta-data. I’m hoping to be able to include Alma’s slides here later; there is a very nice table in it that describes these points.

Bernard Rentier from the Université de Liege in Belgium was the first presenter and gave a very engaging talk. He started with the analogy that a university that doesn’t know what it is publishing is like a factory that doesn’t know what it’s producing. The initial motivation at Liege was to create an inventory of what was being published there. Scholars wanted to be able to extract lists of their publications easily and be more visible to search engines. Bernard went on to describe what he called the Liege approach of carrot and stick and summarized this by saying “If you don’t have a mandate, nothing happens. If you have a mandate and don’t enforce it, nothing happens.” Having a mandate to deposit articles, the enforcement of this mandate, the quality of service provided and the incentives and sanctions in place are all vital. Bernard then described ORBi, the university’s repository. ORBi has 68.000 records and 41.000 full texts (50%), all uploaded by the researchers themselves. Most of the papers which are not available in full text were published before 2002. Papers which have a record in the repository are cited twice as often as papers by Liege authors that do not have a record, something that Bernard attributed to their strongly improved findability. Not all full texts in ORBi are Open Access — roughly half of the texts are embargoed, waiting to be made available after the embargo has been raised. Bernard explained that 20% of what is published in ORBi constitutes what is often called grey literature (reports, unpublished manuscripts) which was now much more visible than before. He noted that ORBi had been marketed as being “not just another tool for librarians”, but rather that his goal had been to involve the entire faculty, something that was also furthered by making the report produced by ORBi the sole document relevant in all performance reports (e.g. for promotions, tenure). ORBi is linked to Liege’s digital university phone book, tying it to general identity information that people might search for. It is also mentioned aggresively on the university website rather than being hidden away on the pages of the library. Bernard closed by saying that today ORBi was attracting an impressive 1100 article downloads per day and that plans were underway to use the system at the Unversity of Luxembourg, the Czech Academy of Sciences and other institutions.

Stuart Shieber followed with a talk on the development of the Hardvard Open Access mandate, introduced in 2008. Since its original introduction, a total of eight Harvard schools have joined the agreement, which generally mandates use of the institutional repository for publications (there is a waiver). Stuart described that first preparations began in 2006 and that there was much discussion in the academic senate. The FAS faculty voted in February 2008 and unanimously accepted the new policy. Its structure was outlined by Stuart as follows:

  • permission (1): author grants the university rights
  • waiver (2): if you want a waiver, you get a waiver
  • deposit (3): mandate of deposit on publication, also everything is deposited including material under embargo

This creates a structure where authors retain a maximum of control over their publications, yet generally deposit what they publish in the university’s repository. Stuart closed by saying (in reference to Bernard) “We’re no trying to apply a stick, we’re trying to apply a carrot” (e.g. statistics for authors on their article use and other incentives).

Next up was William Nixon for the University of Glasgow who presented their repository, Enlighten, William started by saying that he wasn’t wild about the terms “mandate” and “repository”, but that they had sought to communicate the usefulness of Englighten to authors, winning them over rather than forcing them to use the service. He described the wide integration of Enlighten with other services and cited a statistic showing that 80% of traffic to the repository comes from Google. William then gave a historic account of their approach. After launching Enlighten in 2006 and “strongly encouraging” its use by authors, virtually nothing happenend. In 2007 a student thesis mandate was introduced, making it a requirement for all theses to be deposited. In 2008, all publications “where copyright permits” by the faculty were included. In 2010, the report generated by Enlighten was made a key element of the overall research assessment, an important step mirroring the strategy used in Liege. William also discussed staff concerns: What content must be provided? Am I breaking copyright law by using the repository? How and by whom will the publication be seen and accessed online? What version (repository vs. publisher) of my publication will be cited? Wiliam closed by giving a brief account of the repository’s performance record: 14.000 new records had been added in 2010 alone, a rapid growth.

The University of Kansas’ provost and vice chancellor, Jeffrey Vitter, gave a historical account how how KU Scholarworks, the university’s repository had been gradually developed and introduced and pointed to the importance of the advocacy of organizations such as ARL, who had promoted the idea behind IRs and Open Access for many years, making it easier to popularize the idea among the faculty. I apologize for not having an in-depth account of Jeffrey’s talk, but at this point jet lag caught up with me. If you have any notes to contribute for this or any other part of the session, please share.

In the Q&A that followed the presentations what stuck with me was Bernard Rentier’s response to the question of an Elsevier representative about whether collaboration with publishers was not paramount for the success of an open access policy. Bernard emphatically described the difficulties he had experienced in the past when negotiating with major publishers and made clear that while he was open to collaboration a sign of trust would be in order first.

This is my first post reporting from the Berlin 9 Open Access Conference taking place in Bethesda this week. I’ll be reporting and summarizing as thoroughly as I can starting with two pre-conference sessions that took place yesterday.

Note: I’ll include the presenters’ slides here if I can somehow get my hands on them. Stay tuned.

Christoph Bruch of the Max Planck Digital Library (MPDL) opened the first pre-conference session on Open Access Publishing by introducing the four presenters:

  • Neil Thakur, NIH (perspective of funders and government)
  • Peter Binfield, PLoS ONE (perspective of an OA publisher)
  • Pierre Mournier, Cléo/OpenEdition.org (alternate approach to gold/green OA)
  • Caroline Sutton, OAP Association & Co-Action Publishing (perspective of OA advocacy)

Neil Thakur started his talk by saying that he was not presenting official NIH policy, but rather a personal perspective. He pointed to the declining level of science funding in the US and that the response to this development could only be to work longer, work cheaper, or create value more efficiently, arguing that the emphasis should be on the last option. In Neil’s view this had also worked in the past: eletronic publications are faster to find and easier to distribute than ever in the history of scientific research. However more papers and more information don’t necessarily mean more knowledge. Knowledge is still costly, both because of paywalls, but also because of the time that has to be spent on finding relevant information and on integrating it into one’s own research. Neil went on by describing the difficulty and costliness of planning large collaborative projects and the need to increase productivity by letting scientists incorporate papers into their thinking faster. He lamented that many relevant answers to pressing scientific questions (e.g. regarding cancer or climate change) are “buried in papers” and cited natural language processing (NLP), data mining and visual search as techniques that could help to extract more relevant findings from papers. He set a simple but ambitious goal: in 10 years time, a scientist should be able to incorporate 30% more papers into their thinking than today. So what kind of access is required for such approaches? Full and unrestricted access is necessary for summarizing content and analyzing the full text, otherwise the computer can’t mine anything and the improvements in efficiency described fail to materialize. Neil made the excellent point that librarians are generally more concerned with how to disseminate scientific findings vs. funders and scientists who are interested in increasing scientific productivity. Libraries sometimes need to adjust to the notion that the university should ideally produce knowledge, and that knowledge takes on a variety of forms, not just that of peer-reviewed publications. Neil called this vision “all to all communication”, an approach that is ultimately about creating repositories of knowledge rather than repositories of papers. His characterization of “a machine as the first reader” of a paper really resonated with me for stressing the future importance of machine analysis of research results (something that of course applied to science much more than to the social sciences and humanities). Neil furher argued that fair use is a different goal than analysis by machine and that the huge variety of data formats and human access rights made machine reading challenging. Yet the papers that one doesn’t include in one’s research (e.g. because they aren’t accessible) may be those which are crucial to the analysis. Neil also put on the table different ways of measuring scientific impact and quickly concluded that what we currently have (Impact Factor) is insufficient, a criticism that seemed to resonate with the audience. Rather, new measurements should take into account productivity and public impact of a publication, rather than citations or downloads. Finally, Neil concluded by describing various problems caused by licenses that restricts the re-use of material. Re-use is, among other things, extremely important to companies who seek to build products on openly available research results. He ended by saying that “we’re funding science to make our economy stronger”, driving home the relevance of openness not just for access, but also for re-use.

Peter Binfield’s talk presented his employer (PLoS) and its success in developing a business model based on open access publishing. PLoS started modestly in 2000 and became an active publisher in 2003. Today it is one of the largest open access publishing houses in the world and the largest no-for-profit publisher based in the U.S. With headquarters in San Francisco, it has almost 120 employees. Peter noted that while PLoS’ old missions had been to “make scientific and medical literature freely available as a public resource” its new mission is to “accelerate progress in science and medicine by leading a transformation in research communication”, broadening its direction from providing access to publication to being an enabler of scientific knowledge generation in a variety of technological ways. Peter stressed that PLoS consciously uses the CC-BY license to allow for full re-use possibilites. He described the author fees model that is financially the publisher’s main source of income (though there is also some income from ads, donations and membership fees) and noted that PLoS’ article fees have not risen since 2009. Fee waivers are given on a regular basis, assuring that the financial situation of the author does not prevent him/her from publishing. PLoS Biology (founded in 2003) and PLoS Medicine (2004) are the house’s oldest and most traditionally organized journals. They follow the model of Nature or Science, with their own full-time editorial staff, unique front matter and a very small number of rigorously selected papers (about 10 per month). Peter noted that tradeoff of this approach is that while producing excellent scientific content it is also highly labor intensive and makes a loss as a result of this. The two journals were followed up by PLoS Genetics, PLoS Computational Biology, PLoS Pathogens, and PLoS Neglected Tropical Diseases, the so-called PLoS Community Journals, launched between 2005 and 2007. These publications are run by a part-time editorial board of academics working at universities and research institutes, rather than being PLoS employees. Only a relatively small administrative staff supports the community that edits, reviews and publishes submissions, which serves to increase the overall volume of publications. Finally, Peter spoke about PLoS ONE, a very important component of PLoS. While traditional journals have a scope of what is thematically suitable for publication in them, PLoS ONE’s only criteria is the validity of the scientific data and methods used. PLoS ONE publishes papers from a wide range of disciplines (life sciences, mathematics, computer science) asking only “is this work scientific?” rather than “is this work relevant to a specific readership?”. Discussions about relevance occur post-publication on the website, rather than pre-publication behind closed doors. Peter continued by stating that PLoS ONE seeks to “publish everything that is publishable” and that because of the great success of the service, PLoS had reached the point of being financially self-sustaining. By volume, PLoS ONE is now the largest “journal” in the world, an increase in growth that he also linked to the introduction of the Impact Factor (IF) to rank the journal, an important prerequisite for researchers in many countries (e.g. China) who are effectively banned from publishing in non-impact factor journals, something that Peter wryly called “the impact of the impact factors on scientists”. Peter gave the impressive statistic that in 2012, PLoS ONE will publish 1 in 60 of all science papers published worldwide and described a series of “clones”, i.e. journals following a similar concept launched by major commercial publishers. Houses such as Springer and SAGE have started platforms with specific thematic foci that otherwise closely resemble PLoS ONE. Finally, Peter spoke about PLoS’ new initiatives: PLoS Currents, a service for publishing below-article-length content (figures, tables etc) and focusing on rapid dissemination, PLoS Hubs, where post-review of Open Access content produced elsewhere is conducted and which aggregated and enriches openly available results, and PLoS Blogs, a blogging platform (currently 15 active bloggers) used mainly for science communication and to educate the public. Peter closed noting that the Impact Factor is a flawed metric due to being a journal-level measurement, rather than an article-level indicator. He described the wider, more holistic approach taken by PLoS by measuring downloads, usage stats from a variety of services and social media indicators.

Pierre Mournier from Cléo presented OpenEdition, a French Open Access platform focused on the Humanities and Social Sciences and based on a Freemium business model. Cléo, the center for electronic publishing is a joint venture of multiple organizations that employs roughly 30 people. It currently runs revues.org (a publishing platform that hosts more than 300 journals and books), calenda.org (a calender of currently over 16000 conference calls) and hypotheses.org (a scholarly blog platform with over 240 active bloggers). Pierre explained how Cléo re-examined the golden road open access model and found it to be problematic for their constituency. He regarded the problem of subsidy model (no fees have to be paid — the model favored in Brazil) as being very fragile, support can run out suddenly. On the other hand, author fees potentially restrict the growth of a platform and have no tradition in Humanities and Social Sciences, which may be a disincentive to authors. Pierre continued by asking what the role of libraries could be in the future. Cléo’s research highlighted that Open Access resources are used very scarcely via libraries libraries, why users searching at libraries use resources which are toll access (TA) more frequently. Open access interestingly enough appears to mean that researchers (who know where to look) access publications more freely, but students tend to stick to what is made available to them via libraries. Because libraries are the point of access to scientific information for students, they use toll access resources more, for which the library acts as a gatekeeper. Pierrre explained that the Freemium model they developed (also used by services like Zotero or Spotify) based on this observation combines free (libre) and premium (pay) features. Access to HTML is free with openedition, while PDF and epub formats are subscription-based and paid for by libraries. COUNTER statistics are also provided to subscribers. Pierre highlighted the different needs of different communities involved in the academic publication process and notes that the Freemium model gives libraries a vital role, allowing them to continue to act as gatekeepers to some features of otherwise open scholarly content. Currently 20 publishers are using OpenEdition, with 38 research libraries subscribing, and 1000 books available.

Caroline Sutton spoke about “open access at the tipping point”, i.e. recent developments in the Open Access market. OASPA consists of a number of publishers, commercial and non-profit, e.g. BioMed Central, Co-Action Publishing, Copernicus, Hindawi, Journal of Medical Internet Research, Medical Education Online, PLoS, SAGE Publications, SPARC Europe and Utrecht University Library Publishers. The initial activism of OASPA was about dispelling fears about Open Access (Is it peer-reviewed? Is it based on serious research?). Caroline listed factors showing that the broad perception of Open Access has changed over the past few years. The new characterization is that Open Access is about the grand challenges of our time and and important prerequisite for economic growth. The discussion is about the finer points of how OA fits into academic publishing, rather than whether or not it should exist at all. Caroline noted that beyond gold vs. green road, there is now more talk of mixing and combining the two approaches. She pointed to a huge growth in OA publications over the last 2-3 years and noted that “everybody is getting into the game” including commercial publishers such as Springer, SAGE and Wiley. So how necessary is an organization like OASPA if OA is so popular? As Caroline put it “now we can roll up our sleeves and do different things” (e.g. educate legacy publishers and scholarly societies who lack the resources to successfully implement OA). Another area of activity of OASPA is discussing what should count as an open access journal. Free access AND re-use are crucial according to Caroline, who noted that OASPA promotes the use of CC-BY across the board, although there are exceptions to this. It is now about making the point that re-use is interesting, about finding arguments that convince scholars and publishers of the advantages of data mining and aggreation sevices for which re-use is required. Licensing and technical standards are key in this respect. Caroline closed by noting the significance of DOAJ and the development of new payment systems for OA article charges which would make it easier for authors and publishers to utilize OA.

Tagged with:  

An interesting issue — especially to the library and information science community — that Google’s Max Senges raised at the Berlin Symposium on Internet and Society (#bsis11) was how the impact of the instiute’s research could be measured. HIIG’s mission is not just to produce excellent scholarship, but also to foster a meaningful dialog with a wide range of stakeholders beyond academia in relation to the issues that the institute investigates.

This approach has a number of implications that I want to briefly address. My views are my own, but I consider this an exciting test case for a modern, digital form of science evaluation. I believe three things can serve to make the institute’s research as transparent as possible:

  1. primary research results (i.e. papers) should be Open Access,
  2. journalistic contributions (essays, interviews, public speaking) beyond academic publications should be encouraged,
  3. communication of research via social media (blogs, Twitter) should be encouraged.

Open Access is of key importance

David Drummond emphasized the importance of Open Access in his speech at the Institute’s inauguration. A plausible step to make Open Access part of the institute’s culture could be to sign the Berlin Declaration and set up a dedicated repository of institute publications. HIIG could encourage its researchers to publish in gold road Open Access journals such as those listed in the DOAJ and encourage use of a green road approach par the SHERPA/Romeo list in the remaining cases. It could further encourage the use of Creative Commons or similar licences for scholarly publications.

Journalism and engagement with the general public

The public has a considerable interest in the issues investigated at HIIG and accordingly talking with and through traditional media channels will be of great importance. This should not merely be considered a form of marketing, but rather a form of dialog that will allow HIIG to fulfill its obligation to the public to act as an informed voice in civic debate around issues such as privacy and net neutrality. Engagement with the public via essays, interviews, public speaking and similar activities should be considered part of the institute members’ impact.

Social media’s role for science communication

The institute could consider social media as a central avenue of engaging with a wider public and recognize the willingness to use it accordingly. Scholarly blogging, for example, should be considered as part of a member’s research output instead of being regarded as a chiefly private enterprise. Social media activity cannot supplant traditional scholarly publishing, but it can serve to conduct conversations around research, get the attention of non-academics, and point to formal publications, among other things.

So how could this be implemented? The first and second points — making primary research results available and promoting journalistic contributions — are already standard practice elsewhere. The third is a little more tricky. Should it be important how many friends a researcher has on Facebook, or followers on Twitter (assuming he/she is even on these platforms)? Such an approach would be much too simplistic, but perhaps something a little more nuanced could be tried. How about encouraging the use of the #hiig (hash)tag wherever possible and continuously tracking the results? The institute could run its own blog — this may or may not work well, given that many contributors might already have their own one — or a blog planet, a site that just aggregates material from existing blogs that is #hiig-tagged.

These are just general ideas, but eventually they could coalesce into a framework for evaluating HIIG’s impact beyond purely scholarly (and faulty) forms of measurement such as the impact factor.

I’m currently relaxing at HIIG HQ, watching the staff make final preparations for the Institute’s formal inauguration, which will take place at 5pm today at Humboldt University’s Audimax (do drop by if you’re in the area, even if you haven’t been formally invited). I thought I’d share two statements on the launch of the Institute from Google, which were posted today and yesterday.

“Interaktion von Internet, Forschung und Gesellschaft verstehen” (in German)
David Drummond, VP Google, in German newspaper DIE ZEIT

Launching and Internet & Society Research Institute
Max Senges, Google Policy, Google European Public Policy Blog

Tagged with:  

I’m heading to Berlin on an early morning train, among other things for next week’s Berlin Symposium on Internet and Society (#hiig2010). The program is available here and should catch your attention if you’re in Internet research. Be sure to give me a shout if you’re coming and want to have drinks some time.

The Symposium kicks off with the formal inauguration of the newly founded Alexander von Humboldt Institute for Internet and Society (HIIG). This is the official name of what has so far been referred to by most people in the field as “the Google Institute” since the plan to launch it was publicly announced by Eric Schmidt in February.

I’m involved with the Institute as a project associate, which means that I’ll be working on one specific issue (a platform dubbed Regulation Watch — more on that soon) for the next few months. I’m excited to be part of an inspiring and highly interdisciplinary team of people who are all studying the Internet’s effect on culture and society in one way or another, which is an especially exciting prospect if you’ve been more or less on your own with your research interest in this area for the better part of your career.

As I’ve been meaning to get back to blogging anyway, I’ve decided to post updates on what’s happening at HIIG (or “hig”, rhyming with “twig”, as I’ve decided to call my new employer in spoken English) on a semi-regular basis. Next week’s inuaguration and symposium will be covered here with occasional short updates, news flashs and comments, as well as links to stuff other people have posted.

The institute’s mission is both to conduct research and to engage in an ongoing dialogue with the general public, an idea that is very much in accord with the vision of it’s patron. Alexander von Humboldt was a scientist, explorer, diplomat and, frankly, somewhat of a crazy person for trying things that most of his contemporaries considered both mad and futile. His research resonated with society and frequently stirred controversy. He challenged widely held beliefs about the world and was unwavering in his dedication to shedding light on scientific truth beyond superstition. I’m excited by HIIG’s aim to do something similar for our time’s unchartered continent — the net — and look forward to contributing to this goal.

Tagged with:  

Ahead of publishing my TwitterFunctions library of R code (which is constant work in progress) I thought I’d put up some really short Python code for getting a person’s friends and followers. Both scripts rely on Tweepy, my favorite Python implementation of the Twitter API. Install Python (works on Windows as well, not just on Mac/Linux) and then Tweepy on top of that and you are good to go with these two scripts, which can be executed from the command line with
python get_friends.py username

1
2
3
4
5
6
import sys
import tweepy
 
user = sys.argv[1]
for friend in tweepy.api.friends(user):
	print friend.screen_name
1
2
3
4
5
6
import sys
import tweepy
 
user = sys.argv[1]
for follower in tweepy.api.followers(user):
	print follower.screen_name
Tagged with:  

Here’s the announcement for a class that I’m teaching this winter (linked from the university’s e-teaching platform).

Doing A Research Project in English Linguistics: Computer-mediated Communication
Monday, 14.30-16.30, Raum 23.21 00.44B (Forum des Forschungszentrums)
Thursday, 14.30-16.30, Gebäude 23.21 Raum U1.46

Anmerkung: Der Kurs wird als forschungsorientiertes Blockseminar an sieben Terminen im Verlauf des Semesters abgehalten. Die aktive Beteiligung an einer Arbeitsgruppe und die Dokumention der Gruppenarbeit in ILIAS ist zwingende Teilnahmevoraussetzung.

For linguists studying language use, computer-mediated communication (CMC) plays an increasingly important role, both as a source of linguistic data and as an object of study in its own right. CMC historically encompasses a variety of electronic communications channels, such as SMS, paging, and pre-Internet forms of messaging and chat. Increasingly, however, the term identifies Internet-native formats of both Web 1.0 (email, IRC, instant messaging, discussion forums) and Web 2.0 (blogs, social networking sites, microblogging) through which users transmit both typed text and multimedia content.

Herring (2004) presents a concise framework for the analysis of CMC via the methodology of computer-mediated discourse analysis (CMDA). The objective of this class is for participants to form research teams of 2-4 students per group and conduct a small-scale study of CMC applying the CMDA framework that Herring outlines.

In order to give you the opportunity to realize your own project, the class differs from the usual format of weekly meetings. Rather than meeting one per week, the semester is divided into presence phases and research phases. Presence phases are used for discussion, practical training in applying the CMDA framework and the presentation of results, while research phases are at the disposal of the research teams for tasks such as gathering data, analyzing results and preparing presentations. After an introduction to the CMDA methodology, research teams will prepare and present their project proposals to the class, then work on the realization of their respective projects, and, after research has been conducted, present their findings at the end of the semester.

Prerequisites
In order to conduct a (small) CMDA research project under supervision, it is beneficial for students to have a solid foundation in linguistics (morphology, syntax, semantics, pragmatics and especially discourse analysis) as well as an interest in empirical research and communication on and through the Internet (both as an object of research and for communicating among each other during research phases). Technical skills (e.g. knowledge of HTML, basic programming) are not required, but will also be quite useful.

Registration via email
In addition to signing up via the HISLSF, please register for the class by sending an email to Cornelius.Puschmann@uni-duesseldorf.de before October 1st. The number of slots is limited to 30 participants.

BN requirements
- presentation of project proposal (mid-term) or final results (end of the semester)
- active participation during presence and research phases
- regular readings

AP requirements
- term paper

Reading
Herring, S. C. (2004). Computer-mediated discourse analysis: An approach to researching online behavior. In: S. A. Barab, R. Kling, and J. H. Gray (Eds.), Designing for Virtual Communities in the Service of Learning (pp. 338-376). New York: Cambridge University Press. Available online at: http://ella.slis.indiana.edu/~herring/cmda.pdf

Kouper, I. (2010). The pragmatics of peer advice in a LiveJournal community. Language@Internet, 7, article 1. Available online at: http://www.languageatinternet.de/articles/2010/2464/index_html/

Herring, S. C. (2010). Who’s got the floor in computer-mediated conversation? Edelsky’s gender patterns revisited. Language@Internet, 7, article 8. Available online at: http://www.languageatinternet.de/articles/2010/2857/index_html/

Herring, S. C. (In press, 2011). Grammar and electronic communication. In C. Chapelle (Ed.), Encyclopedia of applied linguistics. Hoboken, NJ: Wiley-Blackwell. Available online at: http://ella.slis.indiana.edu/~herring/e-grammar.2011.pdf

Tagged with:  

Unfortunately I’m not able to attend the annual IPrA conference next week in Manchester and had to cancel the trip short notice. I was scheduled to give a talk as part of the session Quoting in Computer-mediated Communication on my work with Katrin Weller on retweeting among scientists.

Luckily for me, there will be a follow-up event of sorts (see below). I’ve posted the call here since it doesn’t seem to be available on the Web other than as a PDF. Submit something if you’re doing research on quoting! I’m fairly sure that the deadline will be extended by a week or two.

CfP: Quoting Now and Then – 3rd International Conference on Quotation and Meaning (ICQM)

University of Augsburg, Germany

19 April – 21 April 2012

Conference Convenors:
Wolfram Bublitz
Jenny Arendholz
Christian Hoffmann
Monika Kirner

Contact: Monika Kirner
E-mail: monika.kirner@phil.uni-augsburg.de

Call for Papers
This conference addresses the pragmatics of quoting as a metacommunicative act both in old (printed) and new (electronically mediated) communication. With the rapid evolution of new media in the last two decades, approaches to the study of (forms, functions and impact of) quoting have been gaining momentum in linguistics. Although quotations in print media have already been investigated to some extent, quoting in computer-mediated communication is still unchartered territory. This conference shall focus on the formal and functional evolution of quoting from old (analog) to new (digital) media. While the conference builds on the panel “Quoting in Computer-mediated Communication” to be presented in July 2011 at the International Conference of Pragmatics (IPrA), it assumes a much broader perspective, paying special tribute to the inherent confluence and complementarity of synchronic and diachronic approaches. Consequently, we invite papers from both (synchronic and diachronic) perspectives to report on the formal, functional as well as the pragmatic-discursive and multimodal nature of quoting in different genres or media.

Plenary talk: Jörg Meibauer

Abstracts:
Please submit an abstract of not more than 500 words (for a 30 min talk plus 10 min discussion) via e-mail to monika.kirner@phil.uni-augsburg.de

Deadline for abstracts:
1 July 2011
15 August 2011

Tagged with:  

I’ve been following the development of googleVis, the implementation of the Google Visualization API for R, for a bit now. The library has a lot of potential as a bridge between R (where data processing happens) and HTML (where presentation is [increasingly] happening). A growing number of visualization frameworks are on the market and all have their perks (e.g. Many Eyes, Simile, Flare). I guess I was inspired in such a way by the Hans Rosling Show TED talk that makes such great use of bubble charts that I wanted to try the Google Vis API for that chart type alone. There’s more, however, if you don’t care much for floating blubbles: neat chart variants include the geochart, area charts and the usual classics (bar, pie, etc). Check out the chart gallery for an overview.

So here are my internet growth charts:

(1) Motion chart showing the growth of the global internet population since 2000 for 208 countries

(2) World map showing global internet user statistics for 2009 for 208 countries

Data source: data.un.org (ITU database). I’ve merged two tables from the database into one (absolute numbers and percentages) and cleaned the data up a bit. The resulting tab-separated CSV file is available here.

And here’s the R code for rendering the chart. Basically you just replace gvisMotionChart() with gvisGeoChart() for the second chart, the rest is the same.

1
2
3
4
library("googleVis")
n <- read.csv("netstats.csv", sep="\t")
nmotion <- gvisMotionChart(n, idvar="Country", timevar="Year", options=list(width=1024, height=768))
plot(nmotion)
Tagged with:  

I meant to post this a month or so ago, when I was conducting my study of casual tweeting, but didn’t get to it. No harm in posting it now, I guess — code doesn’t go bad, fortunately.

Note: this requires Linux/Unix/OSX, Python 2.6 and the tweepy library. It might also work on Windows, but I haven’t checked.

1. Fetching a single user’s tweets with twitter_fetch.py

The purpose of the script below is to automatically retrieve all new tweets by one or more users, where “new” means all tweets that have been added since the last round of archiving. If the script is called for the first time for a given user, it will try to retrieve all available tweets for that person. It relies on the tweepy package for Python, which is one of a number of libraries providing access to the Twitter API. In case you’re looking for a library for R, check out twitteR.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import sys
import time
import os
import tweepy
 
# make sure that the directory 'Tweets' exists, this is
# where the tweets will be archived
wdir = 'Tweets'
user = sys.argv[1]
id_file = user + '.last_id'
timeline_file = user + '.timeline'
 
if os.path.exists(wdir + '/' + id_file):
	f = open(wdir + '/' + id_file, 'r')
	since = int(f.read())
	f.close()
	tweets = tweepy.api.user_timeline(user, since_id=since)
else:
	tweets = tweepy.api.user_timeline(user)
 
if len(tweets) > 0:
	last_id = str(tweets[0].id)
	tweets.reverse()
 
	# write tweets to file
	f = open(wdir + '/' + timeline_file, 'a+')
	for tweet in tweets:
		output = str(tweet.created_at) + '\t' + tweet.text.replace('\r', ' ').encode('utf-8') + '\t' + tweet.source.encode('utf-8') + '\n'
		f.write(output)
		print output
	f.close()
 
	# write last id to file
	f = open(wdir + '/' + id_file, 'w')
	f.write(last_id)
	f.close()
else:
	print 'No new tweets for ' + user

The code is pretty straight-forward. I wrote it without really knowing Python beyond the bare essentials and relying heavily on IPython‘s code completion. Actual retrieval of tweets happens in a single line:

tweets = tweepy.api.user_timeline(user)
 

The rest of the script is devoted to managing the data and making sure only new tweets are retrieved. This is done via the since_id parameter which is fed the last recorded id that has been saved to the user’s id file in the previous round of archiving. There are more elegant ways of doing this, but any improvements are up to you. ;-)

2. Fetching a bunch of different users’ tweets with twitter_fetch_all.sh

Second comes a very simple bash script. The only thing it does is call twitter_fetch.py once for each user in a list of people you want to track. Again, there are probably other ways of doing this, but I wanted to keep the functions of the two different scripts separate.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# This is will perform twitter_fetch,py on the twitter_users[] array. Add any number of twitter_users[NUMBER]="USER" lines below
# to archive additional accounts.
 
# --- twitter user list ---
twitter_users[0]="SomeUser"
twitter_users[1]="SomeOtherUser"
twitter_users[2]="YetAnotherUser"
twitter_users[3]="YouGetTheIdea"
 
# --- execute twitter_fetch.py ---
for twitter_user in ${twitter_users[*]}
do
	echo "Getting tweets for user $twitter_user"
	python twitter_fetch.py $twitter_user
	echo "Done."
	echo ""
done

You should place this in the same directory as twitter_fetch.py and modify it to suit your needs.

3. Automating the whole thing with a cronjob

Finally, here’s a cron directive I used to automate the process and log the result in case any errors occur. Read the linked Wikipedia article if you’re unfamiliar with cron, it’s a very convenient way of automating tasks on Linux/Unix.

0 * * * * sh /root/twitter_fetch_all.sh >/root/twitter_fetch.log
 

(Yes, I’m running this as root. Because I can. And because it’s an EC2 instance with nothing else on it anyway.)

Hope it’s useful to someone, let me know if you have any questions. :-)

Tagged with: