Edit: A summary of the recent Open Science session at the Berlin Colloquium on Internet and Society with talks from Constanze Engelbrecht, Pasco Bilic and Christoph Lutz has been posted by the Humboldt Institute’s Benedikt Fecher (german version, english version). The text below is a more general discussion of how Open Science can be defined.
One area of research at the Alexander von Humboldt Institute is Open Science, an emerging term used to describe new ways of conducting research and communicating its results through the Internet. There is no single definition of what constitutes Open Science (and one could argue there doesn’t really need to be), but in this blog entry I want to point to attempts to define the term by prominent scientists and activists, and discuss some of the limitations of these definitions. I’ll summarize my observations in the form of five questions that suggest a direction that future research into Open Science could take.
Open Science: a few working definitions
Michael Nielsen – a prominent scientist and author on Open Science whose name pops up invariently when discussing the topic – provides this very comprehensive definition in a post to the Open Science mailing list:
“Open science is the idea that scientific knowledge of all kinds should be openly shared as early as is practical in the discovery process.”
In the same vein, Peter Murray-Rust, a professor in molecular chemistry and Open Access advocate, provides another definition (also through the OKFN’s open science mailing list):
“In a full open science process the major part of the research would have been posted openly and would potentially have been available to people outside the research group both for reading and comment.”
(Also see this interview if you want a more detailed exposition).
Finally, Jean Claude Bradley, also a professor in chemistry, provides a definition of what he calles Open Notebook Science, a very similar approach:
“[In Open Notebook Science] there is a URL to a laboratory notebook that is freely available and indexed on common search engines. It does not necessarily have to look like a paper notebook but it is essential that all of the information available to the researchers to make their conclusions is equally available to the rest of the world.”
(Here’s a presentation summarizing his approach, Open Notebook Science. A similar view is articulated by M. Fabiana Kubke and Daniel Mietchen in this video, though they prefer the term Open Research.)
From natural philosophy to science
One thing that these different definitions have in common is the way in which they frame science. In English, the word science has come to denote primarily the natural sciences (traditionally physics and chemistry, more recently also biology and life sciences). The history of the term is long and complex (check out the Wikipedia entry), but as a result of language change, a wide range of disciplines are considered not to be part of the sciences, but instead belong to the social sciences and Humanities.
Why does this matter? The above definitions are very closely tailored to the methods and organizational structures of the natural sciences. They assume that research is conducted in a research group (Murray-Rust) that works primarily in a laboratory and whose members record the steps of an experimental process in a lab notebook (Bradley), following a sequence of more or less clearly-structured steps that can be summarized as “the discovery process” (Nielsen).
Research processes in other fields strongly differ from this approach, not just in the Humanities (where there is frequently no research group, and data is of varying relevance), but also in the social sciences (where there is generally no laboratory, and data frequently comes from human subjects, rather than technical instruments such as radio telescopes or DNA sequencers). Beyond just using different tools, the instruments also shape the assumptions of their users about the world, and about what they do. Sociologist Karin Knorr-Cetina points to this difference in the title of her book Epistemic Cultures, and similar observations have been made in Bruno Latour & Steve Woolgar’s Laboratory Life: The Construction of Scientific Facts. One crucial aspect of this is how data is conceptualized in the different disciplinary perspectives, and, related to this, how notions differ regarding what openness means.
Openness beyond open access to publications
Openness can be defined in a variety of ways. Not all information that is available online is open in a technical sense – just think about proprietary file formats that make it difficult to share and re-use data. Technical openness does not equal legal openness, a problem that is also on the institute’s agenda.
Open Access – the technical and legal accessibility of scholarly publications via the Internet – is widely regarded to benefit both science and the public at large. In the traditional publishing model access to research results in scholarly monographs and journals is available to subscribers only (usually institutional subscribers, in other words, libraries). The Open Access model shifts the costs, sometimes to authors (who pay a fee to publish) or to publishing funds or other institutional actors. The Budapest and Berlin Declarations on Open Access specify under which provisions publications are truly Open Access, rather than just somehow accessible. Open Access has a range of benefits, from reducing costs and providing access to scientists at small universities and in developing countries, to increasing transparency and raising scholarly impact. Models based on author fees, such as the one utilized by PLoS, are increasingly common and make Open Access economically feasible.
There is broad consensus that Open Access is a first step, but that it’s not enough. Many scientists, such as the ones cited above, call for research data also to be made available more broadly. Sharing research data, instead of packaging data and analysis together in scholarly articles, could enable new forms of research that are much more complementary than current practices, which tend to emphasize positive outcomes (experiments that worked) over negative ones (those that didn’t), despite the fact that negative outcomes can greatly contribute to better understanding a problem.
Making openness count
The barriers to achieving a more open environment in regards to research data aren’t primarily technical or legal, but cultural. Research has always been based on the open dissemination of knowledge (just take the history of the Philosophical Transactions, considered by most to be the oldest scientific journal), but it is also very closely tied to the formats in which knowledge is stored and disseminated, such as books, journal articles, and conference papers, which tend to take on a valorizing role, rather than being just arbitrary containers of scholarly information. Many scholars, regardless of their field, see themselves in the business of publishing books, articles, and papers just as much as they consider themselves to be in the business of doing research. While the technology behind scholarly publishing has changed dramatically, the concepts have not changed. Because institutionalized academia is incentive-driven and highly competetive, collective goals (a more efficient approach to knowledge production) are trumped by individual ones (more highly-ranked publications = more funding and promotions for the individual researcher).
Institutional academia is no longer the only place where research happens. Increasingly, there is (if latently) competition from crowdsourcing platforms that facilitate collaborative knowledge creation (and, more open, problem solving) outside of institutional contexts. Depending on how you define the process of knowledge production, examples include both Wikipedia and projects such as the #SciFund Challenge. The approach to knowledge production in these environments seems to focus on knowledge recombination and remixing at the moment, but it appears plausible that more sophisticated models could arise in the future. Whether the hybrid communities of knowledge production have a potential to displace established institutional academia remains to be seen. Rather, such communities could blossom in those areas where traditional academia fails to deliver.
But even inside institutional academia, the time seems ripe for more openness beyond making publications and data available to other academics. Social media makes it possible for scholars to both communicate with their peers and engage with the public more directly — though they are still hesitant to do either at the moment. Public visibility is not as high on the agenda of most researchers as one might expect, because academic success is largely the result of peer, not popular evaluation.
Redefining scholarly impact
This may change as new, more open measurements of scholarly impact enter the mainstream. Measuring and evaluating the impact and quality of publicly-funded research has been a key political interest for decades. While frameworks exist for conducting large and complex evaluations (Research Assessment Exercises in the UK, Exzellenzinitiative in Germany) the metrics used to evaluate the performance of researchers are generally criticized as too one-dimensional. This criticism applies in particular to measuerments that indicate the quality of publications such as Thompson Reuters’ Impact Factor (IF). A confluence of measures (downloads, views, incoming links) could change the current, extremely one-sided approach to evaluation and make it more holistic, generating a more nuanced picture of scholarly performance.
Questions for research into Open Science
The following questions reflect some of the issues raised by “open” approaches to science and scholarship. They are by no means the only ones, als the Open Science project description on the pages of the HIIG highlights, but reflect my personal take.
- How can Open Science be conceptualized in ways that reach beyond the paradigm of the natural sciences? In other words, what should Open Humanities and Open Social Sciences look like?
- How do different types of data (recorded by machines, created by human subjects, classified and categorized by experts) and diverse methods used for interacting with it (close reading, qualitative analysis, hermeneutics, statistical approaches, data mining, machine learning) impact knowledge creation and what are their respective potentials for openness in the sense described by Nielsen, Murray-Rust and Bradley? What are limits to openness, e.g. for ethical, economic and political reasons?
- What are features of academic openness beyond open access (e.g. availability of data, talks, teaching materials, social media media presence, public outreach activities) and how do they apply differently to different disciplines?
- How can the above-mentioned features be used for a facetted, holistic evaluation of scholarly impact that goes beyond a single metric (in other words, that measures visibility, transparency and participation in both scientific and public contexts)?
- What is the relationship between institutionalized academia and hybrid virtual communities and platforms? Are they competitive or complementary? How do their approaches to knowledge production and the incentives they offer to the individual differ?
I’ve blogged the following entries covering the Berlin 9 Open Access Conference:
- Berlin 9: Pre-conference session on Open Access Publishing (Tuesday)
- Berlin 9: Pre-conference session on Open Access Policy (Tuesday)
- Berlin 9: Opening session (Wednesday)
- Berlin 9: The Worldwide Policy Environment (Wednesday)
And here’s a rough list of blog posts and news items from other sources that I’ve come across:
Avice Meehan moderated the first session of the Berlin 9 Open Access conference session on The Worldwide Policy Environment. She introduced the three presenters:
- Jean-François Dechamp, Policy Officer, European Commission, Directorate-General for Research and Innovation
- Harold Varmus, Director, U.S. National Cancer Institute
- Cyril Muller, Vice President, External Affairs Department, The World Bank
After a brief introduction by Avice, Jean-Francois Dechamp took to the podium, to talk about the European policy context of Open Access. Jean-Francois described how the European Commission acts as a policy maker, a funding agency, and as an infrastructure funder and capacity builder. He cited Commission documents stating that “publicly funded research should be open access” and the noted that the Commission aims to to make Open Access to publicatons “the gerade principle for projects funded by the EU research Framework Programmes”. Key reasons for the European Commission to support Open Access include to serve science and research, benefit innovation and improve return on investment in R&D. OA publishing costs (article charges) are covered by FP7, although fairly few researchers realize this. Dechamp cited a study conducted by the EUC where the majority of researchers involved indicated that they were ready to self-archive, but that the legal challenges were daunting. He cited a soon-to-be-released study (ERAC, 2010-2011) that found that the overall significance of OA in the member states has significantly increased over the past few years.
Harold Varmus of the U.S. National Cancer Institute and NIH came next. Harold stressed that he was not speaking as the representative of a policy-making institution, but as a scientist. He lamented that the shift towards OA is not happening fast enough and asked for a broader idea of Open Access that must go beyond access to publications, to access to data and (ultimately) knowledge. True Open Access, according to Harold, means gold road OA, in accord with the Berlin Declaration — embargos aren’t good enough. Harold traced his contact with OA to 1998 when he heard about arXiv (built by Paul Ginsparg) and thought that such a resource should also exist for biomedicine. He went on to emphasize that different fields have different needs, and that publishing must be sensitive to these needs. Harold also stressed the success of Pubmed Central, with a size of now 2 mio articles. In 2006 publishers were encouraged to donate articles (with limited success), in 2008 a mandate was introduced to publish NIH-supported research on PubmedCentral after an embargo period. Harold noted that economics are essential and that there’s always a business plan attached to journals. He noted that while researchers love their publishers, they love the people who give them money even more, pointing to the central influence of funders in relation to OA. Harold noted the success of PLoS, specifically of PLoS ONE. He further echoed Cathy Norton’s observation that the public at large wants access — not just abstracts and titles, but the actual data. While articles are the best product of academic research, they are also emotionally laden. Harold noted that while funders see articles as mere vehicles of knowledge, authors also write for fame and prestige, not just to contribute to knowledge. He closed by arguing strongly for a new regime of review (post rather than pre). Authors should be forced to list their most important contributions rather than bean counting by relying on long publication lists and the impact factor.
Cyril Muller approached the topic differntly in his talk, focusing on the Open Data Approach of his insitution, the World bank, and on the positive effects that they had observed in making the data collected by them digitally available. He described the three pillars of their approach (Open Data, Open Knowledge, Open Solutions) and presented statistics on how much information were now made available online, rather than in print via their Open Knowledge Repository. He provided interesting examples of information-enabled innovation in Africa and elsewhere. My notes are unfortunately somewhat incomplete on Cyril’s talk, but it really focused on Open (Government) Data more than on Open Access (to Scholarly Publications), putting it more into a thematic camp with a variety of initiatives from that direction.
The conference opened with welcoming remarks, first from the HHMI’s VP and chief scientific officer Jack E. Dixon, then from HHMI’s head Robert Tijan, followed by the Max Planck Society’s Bernard Schutz, and finally from the Marine Biology Lab’s Cathy Norton. Jack Dixon struck an optimistic note, observing that “the tide is turning, in a very positive way.” Robert Tijan observed that those who fund research should be more active in publishing, a reference to eLife, a new Open Access journal in the lifes sciences jointly launched by HHMI and the Max Planck Society. He went on to note that “scientific work is not complete before the results become accessible… what we do doesn’t have any impact otherwise.” Bernard Schutz focused on the development of the Berlin Declartion in his talk. 30 institutions had been original signatories in 2003 when the Declaration was first drafted, 338 institutions are now among the signatories. A global expansion of the Berlin meetings from Europe (Berlin 1 to Berlin 7) to the world (Berlin 8 in China, Berlin 9 in the U.S.) had been vital, because “research and publishing are glibal issues”. Bernard noted that much had been achieved in relation to green road OA and repositories, but that the Max Planck Society regards the popularization of gold road open access as an important achievement for the future. He went on to note that interdisciplinarity and innovation (e.g. in business) are enabled by OA. Free information is a common good and the spread of knowledge to stakeholders outside academia (teachers and students) is enabled by OA. Bernard observed that to many publishers “the business model is less important than the business itself” and that many publishers would transition to OA if viable business models could be established. He decribed disagreements between publishers, institutions, and researchers in some areas and stressed that the Max Planck Society is ready to work with all stakeholders on the issues at hand. Finally, he stated “we want to become more inclusive” and characterized Open Access as part of a larger movement towards (more) Free Information.
Cathy Norton from the Marine Biology Lab focused on issues close to her field in her talk. She discussed the success of MedLine and pointed out how interested the public is in certain areas of scientific information. The future of medicine, according to Cathy, lies in personalization of drugs and treatments, something that can only be achieved by having large volumes of data freely available. Techniques such as text mining and visual search are key to utilizing such new approaches, as are efforts such as semantic MedLine that map ontological relationships in large volumes of text. Cathy closed by noting the importance of citizen engagement, e.g. in relation to biodiversity data (95% of the publications on biodiversity are from North America and Europe, while the species described are virtually all found in Africa and South America).
The session closed with a questions from Stuart Shieber who wondered how the Max Planck Society wants to support creating an environment that allows publishers to transition to Open Access, a hint that Bernard Schutz made. Bernard replied that there were ongoing conversations between publishers and the MPS on these issues.
This is my second report from the Berlin 9 Open Access Conference, this one summarizing Tuesday’s session on Open Access Policy. I’m still catching up on yesterday’s talks and will post those later today or early tomorrow.
The session was moderated by Alma Swan of Enabling Open Scholarship, also director of Key Perspectives Ltd. Alma introduced the panelists:
- Bernard Rentier, Rector, Université de Liege
- Stuart Shieber, Director, Office for Scholarly Communication, Harvard University
- William Nixon, Digital Library Development Manager, University of Glasgow
- Jeffrey Vitter, Provost and Executive Vice Chancellor, University of Kansas
After this, Alma laid out some of the key issues on which the presenters would focus in their talks, namely the precise wording of the institutional open access policy that they have put into place, the people involved in planning and implementing it, the nature of the implementation and finally the resources for ongoing support (as she pointed out, if there is no ongoing support, open acces does not work). Alma then proposed an elaborate typology of policies based on multiple factors, i.e. who retains rights, whether or not there is a waiver, when deposit takes place and whether or not there is an embargo on the full text or the article meta-data. I’m hoping to be able to include Alma’s slides here later; there is a very nice table in it that describes these points.
Bernard Rentier from the Université de Liege in Belgium was the first presenter and gave a very engaging talk. He started with the analogy that a university that doesn’t know what it is publishing is like a factory that doesn’t know what it’s producing. The initial motivation at Liege was to create an inventory of what was being published there. Scholars wanted to be able to extract lists of their publications easily and be more visible to search engines. Bernard went on to describe what he called the Liege approach of carrot and stick and summarized this by saying “If you don’t have a mandate, nothing happens. If you have a mandate and don’t enforce it, nothing happens.” Having a mandate to deposit articles, the enforcement of this mandate, the quality of service provided and the incentives and sanctions in place are all vital. Bernard then described ORBi, the university’s repository. ORBi has 68.000 records and 41.000 full texts (50%), all uploaded by the researchers themselves. Most of the papers which are not available in full text were published before 2002. Papers which have a record in the repository are cited twice as often as papers by Liege authors that do not have a record, something that Bernard attributed to their strongly improved findability. Not all full texts in ORBi are Open Access — roughly half of the texts are embargoed, waiting to be made available after the embargo has been raised. Bernard explained that 20% of what is published in ORBi constitutes what is often called grey literature (reports, unpublished manuscripts) which was now much more visible than before. He noted that ORBi had been marketed as being “not just another tool for librarians”, but rather that his goal had been to involve the entire faculty, something that was also furthered by making the report produced by ORBi the sole document relevant in all performance reports (e.g. for promotions, tenure). ORBi is linked to Liege’s digital university phone book, tying it to general identity information that people might search for. It is also mentioned aggresively on the university website rather than being hidden away on the pages of the library. Bernard closed by saying that today ORBi was attracting an impressive 1100 article downloads per day and that plans were underway to use the system at the Unversity of Luxembourg, the Czech Academy of Sciences and other institutions.
Stuart Shieber followed with a talk on the development of the Hardvard Open Access mandate, introduced in 2008. Since its original introduction, a total of eight Harvard schools have joined the agreement, which generally mandates use of the institutional repository for publications (there is a waiver). Stuart described that first preparations began in 2006 and that there was much discussion in the academic senate. The FAS faculty voted in February 2008 and unanimously accepted the new policy. Its structure was outlined by Stuart as follows:
- permission (1): author grants the university rights
- waiver (2): if you want a waiver, you get a waiver
- deposit (3): mandate of deposit on publication, also everything is deposited including material under embargo
This creates a structure where authors retain a maximum of control over their publications, yet generally deposit what they publish in the university’s repository. Stuart closed by saying (in reference to Bernard) “We’re no trying to apply a stick, we’re trying to apply a carrot” (e.g. statistics for authors on their article use and other incentives).
Next up was William Nixon for the University of Glasgow who presented their repository, Enlighten, William started by saying that he wasn’t wild about the terms “mandate” and “repository”, but that they had sought to communicate the usefulness of Englighten to authors, winning them over rather than forcing them to use the service. He described the wide integration of Enlighten with other services and cited a statistic showing that 80% of traffic to the repository comes from Google. William then gave a historic account of their approach. After launching Enlighten in 2006 and “strongly encouraging” its use by authors, virtually nothing happenend. In 2007 a student thesis mandate was introduced, making it a requirement for all theses to be deposited. In 2008, all publications “where copyright permits” by the faculty were included. In 2010, the report generated by Enlighten was made a key element of the overall research assessment, an important step mirroring the strategy used in Liege. William also discussed staff concerns: What content must be provided? Am I breaking copyright law by using the repository? How and by whom will the publication be seen and accessed online? What version (repository vs. publisher) of my publication will be cited? Wiliam closed by giving a brief account of the repository’s performance record: 14.000 new records had been added in 2010 alone, a rapid growth.
The University of Kansas’ provost and vice chancellor, Jeffrey Vitter, gave a historical account how how KU Scholarworks, the university’s repository had been gradually developed and introduced and pointed to the importance of the advocacy of organizations such as ARL, who had promoted the idea behind IRs and Open Access for many years, making it easier to popularize the idea among the faculty. I apologize for not having an in-depth account of Jeffrey’s talk, but at this point jet lag caught up with me. If you have any notes to contribute for this or any other part of the session, please share.
In the Q&A that followed the presentations what stuck with me was Bernard Rentier’s response to the question of an Elsevier representative about whether collaboration with publishers was not paramount for the success of an open access policy. Bernard emphatically described the difficulties he had experienced in the past when negotiating with major publishers and made clear that while he was open to collaboration a sign of trust would be in order first.
This is my first post reporting from the Berlin 9 Open Access Conference taking place in Bethesda this week. I’ll be reporting and summarizing as thoroughly as I can starting with two pre-conference sessions that took place yesterday.
Note: I’ll include the presenters’ slides here if I can somehow get my hands on them. Stay tuned.
Christoph Bruch of the Max Planck Digital Library (MPDL) opened the first pre-conference session on Open Access Publishing by introducing the four presenters:
- Neil Thakur, NIH (perspective of funders and government)
- Peter Binfield, PLoS ONE (perspective of an OA publisher)
- Pierre Mournier, Cléo/OpenEdition.org (alternate approach to gold/green OA)
- Caroline Sutton, OAP Association & Co-Action Publishing (perspective of OA advocacy)
Neil Thakur started his talk by saying that he was not presenting official NIH policy, but rather a personal perspective. He pointed to the declining level of science funding in the US and that the response to this development could only be to work longer, work cheaper, or create value more efficiently, arguing that the emphasis should be on the last option. In Neil’s view this had also worked in the past: eletronic publications are faster to find and easier to distribute than ever in the history of scientific research. However more papers and more information don’t necessarily mean more knowledge. Knowledge is still costly, both because of paywalls, but also because of the time that has to be spent on finding relevant information and on integrating it into one’s own research. Neil went on by describing the difficulty and costliness of planning large collaborative projects and the need to increase productivity by letting scientists incorporate papers into their thinking faster. He lamented that many relevant answers to pressing scientific questions (e.g. regarding cancer or climate change) are “buried in papers” and cited natural language processing (NLP), data mining and visual search as techniques that could help to extract more relevant findings from papers. He set a simple but ambitious goal: in 10 years time, a scientist should be able to incorporate 30% more papers into their thinking than today. So what kind of access is required for such approaches? Full and unrestricted access is necessary for summarizing content and analyzing the full text, otherwise the computer can’t mine anything and the improvements in efficiency described fail to materialize. Neil made the excellent point that librarians are generally more concerned with how to disseminate scientific findings vs. funders and scientists who are interested in increasing scientific productivity. Libraries sometimes need to adjust to the notion that the university should ideally produce knowledge, and that knowledge takes on a variety of forms, not just that of peer-reviewed publications. Neil called this vision “all to all communication”, an approach that is ultimately about creating repositories of knowledge rather than repositories of papers. His characterization of “a machine as the first reader” of a paper really resonated with me for stressing the future importance of machine analysis of research results (something that of course applied to science much more than to the social sciences and humanities). Neil furher argued that fair use is a different goal than analysis by machine and that the huge variety of data formats and human access rights made machine reading challenging. Yet the papers that one doesn’t include in one’s research (e.g. because they aren’t accessible) may be those which are crucial to the analysis. Neil also put on the table different ways of measuring scientific impact and quickly concluded that what we currently have (Impact Factor) is insufficient, a criticism that seemed to resonate with the audience. Rather, new measurements should take into account productivity and public impact of a publication, rather than citations or downloads. Finally, Neil concluded by describing various problems caused by licenses that restricts the re-use of material. Re-use is, among other things, extremely important to companies who seek to build products on openly available research results. He ended by saying that “we’re funding science to make our economy stronger”, driving home the relevance of openness not just for access, but also for re-use.
Peter Binfield’s talk presented his employer (PLoS) and its success in developing a business model based on open access publishing. PLoS started modestly in 2000 and became an active publisher in 2003. Today it is one of the largest open access publishing houses in the world and the largest no-for-profit publisher based in the U.S. With headquarters in San Francisco, it has almost 120 employees. Peter noted that while PLoS’ old missions had been to “make scientific and medical literature freely available as a public resource” its new mission is to “accelerate progress in science and medicine by leading a transformation in research communication”, broadening its direction from providing access to publication to being an enabler of scientific knowledge generation in a variety of technological ways. Peter stressed that PLoS consciously uses the CC-BY license to allow for full re-use possibilites. He described the author fees model that is financially the publisher’s main source of income (though there is also some income from ads, donations and membership fees) and noted that PLoS’ article fees have not risen since 2009. Fee waivers are given on a regular basis, assuring that the financial situation of the author does not prevent him/her from publishing. PLoS Biology (founded in 2003) and PLoS Medicine (2004) are the house’s oldest and most traditionally organized journals. They follow the model of Nature or Science, with their own full-time editorial staff, unique front matter and a very small number of rigorously selected papers (about 10 per month). Peter noted that tradeoff of this approach is that while producing excellent scientific content it is also highly labor intensive and makes a loss as a result of this. The two journals were followed up by PLoS Genetics, PLoS Computational Biology, PLoS Pathogens, and PLoS Neglected Tropical Diseases, the so-called PLoS Community Journals, launched between 2005 and 2007. These publications are run by a part-time editorial board of academics working at universities and research institutes, rather than being PLoS employees. Only a relatively small administrative staff supports the community that edits, reviews and publishes submissions, which serves to increase the overall volume of publications. Finally, Peter spoke about PLoS ONE, a very important component of PLoS. While traditional journals have a scope of what is thematically suitable for publication in them, PLoS ONE’s only criteria is the validity of the scientific data and methods used. PLoS ONE publishes papers from a wide range of disciplines (life sciences, mathematics, computer science) asking only “is this work scientific?” rather than “is this work relevant to a specific readership?”. Discussions about relevance occur post-publication on the website, rather than pre-publication behind closed doors. Peter continued by stating that PLoS ONE seeks to “publish everything that is publishable” and that because of the great success of the service, PLoS had reached the point of being financially self-sustaining. By volume, PLoS ONE is now the largest “journal” in the world, an increase in growth that he also linked to the introduction of the Impact Factor (IF) to rank the journal, an important prerequisite for researchers in many countries (e.g. China) who are effectively banned from publishing in non-impact factor journals, something that Peter wryly called “the impact of the impact factors on scientists”. Peter gave the impressive statistic that in 2012, PLoS ONE will publish 1 in 60 of all science papers published worldwide and described a series of “clones”, i.e. journals following a similar concept launched by major commercial publishers. Houses such as Springer and SAGE have started platforms with specific thematic foci that otherwise closely resemble PLoS ONE. Finally, Peter spoke about PLoS’ new initiatives: PLoS Currents, a service for publishing below-article-length content (figures, tables etc) and focusing on rapid dissemination, PLoS Hubs, where post-review of Open Access content produced elsewhere is conducted and which aggregated and enriches openly available results, and PLoS Blogs, a blogging platform (currently 15 active bloggers) used mainly for science communication and to educate the public. Peter closed noting that the Impact Factor is a flawed metric due to being a journal-level measurement, rather than an article-level indicator. He described the wider, more holistic approach taken by PLoS by measuring downloads, usage stats from a variety of services and social media indicators.
Pierre Mournier from Cléo presented OpenEdition, a French Open Access platform focused on the Humanities and Social Sciences and based on a Freemium business model. Cléo, the center for electronic publishing is a joint venture of multiple organizations that employs roughly 30 people. It currently runs revues.org (a publishing platform that hosts more than 300 journals and books), calenda.org (a calender of currently over 16000 conference calls) and hypotheses.org (a scholarly blog platform with over 240 active bloggers). Pierre explained how Cléo re-examined the golden road open access model and found it to be problematic for their constituency. He regarded the problem of subsidy model (no fees have to be paid — the model favored in Brazil) as being very fragile, support can run out suddenly. On the other hand, author fees potentially restrict the growth of a platform and have no tradition in Humanities and Social Sciences, which may be a disincentive to authors. Pierre continued by asking what the role of libraries could be in the future. Cléo’s research highlighted that Open Access resources are used very scarcely via libraries libraries, why users searching at libraries use resources which are toll access (TA) more frequently. Open access interestingly enough appears to mean that researchers (who know where to look) access publications more freely, but students tend to stick to what is made available to them via libraries. Because libraries are the point of access to scientific information for students, they use toll access resources more, for which the library acts as a gatekeeper. Pierrre explained that the Freemium model they developed (also used by services like Zotero or Spotify) based on this observation combines free (libre) and premium (pay) features. Access to HTML is free with openedition, while PDF and epub formats are subscription-based and paid for by libraries. COUNTER statistics are also provided to subscribers. Pierre highlighted the different needs of different communities involved in the academic publication process and notes that the Freemium model gives libraries a vital role, allowing them to continue to act as gatekeepers to some features of otherwise open scholarly content. Currently 20 publishers are using OpenEdition, with 38 research libraries subscribing, and 1000 books available.
Caroline Sutton spoke about “open access at the tipping point”, i.e. recent developments in the Open Access market. OASPA consists of a number of publishers, commercial and non-profit, e.g. BioMed Central, Co-Action Publishing, Copernicus, Hindawi, Journal of Medical Internet Research, Medical Education Online, PLoS, SAGE Publications, SPARC Europe and Utrecht University Library Publishers. The initial activism of OASPA was about dispelling fears about Open Access (Is it peer-reviewed? Is it based on serious research?). Caroline listed factors showing that the broad perception of Open Access has changed over the past few years. The new characterization is that Open Access is about the grand challenges of our time and and important prerequisite for economic growth. The discussion is about the finer points of how OA fits into academic publishing, rather than whether or not it should exist at all. Caroline noted that beyond gold vs. green road, there is now more talk of mixing and combining the two approaches. She pointed to a huge growth in OA publications over the last 2-3 years and noted that “everybody is getting into the game” including commercial publishers such as Springer, SAGE and Wiley. So how necessary is an organization like OASPA if OA is so popular? As Caroline put it “now we can roll up our sleeves and do different things” (e.g. educate legacy publishers and scholarly societies who lack the resources to successfully implement OA). Another area of activity of OASPA is discussing what should count as an open access journal. Free access AND re-use are crucial according to Caroline, who noted that OASPA promotes the use of CC-BY across the board, although there are exceptions to this. It is now about making the point that re-use is interesting, about finding arguments that convince scholars and publishers of the advantages of data mining and aggreation sevices for which re-use is required. Licensing and technical standards are key in this respect. Caroline closed by noting the significance of DOAJ and the development of new payment systems for OA article charges which would make it easier for authors and publishers to utilize OA.
I read about this new book series titled Scholarly Communication: Past, present and future of knowledge inscription this morning on the Humanist mailing list. Since scholarly communication is one my main research interests, I’m thrilled to hear that there will be a series devoted to publications focusing on the topic, edited and reviewed by a long list of renown scholars in the field.
On the other hand it’s debatable (see reactions by Michael Netwich and Toma Tasovac) whether a book series on the future of scholarly communication is not a tad anachronistic, assuming it is published exclusively in print (seems to be the case from the look of the announcement on the website). New approaches, such as the crowdsourcing angles of Hacking the Academy or Digital Humanities Now, seem more in sync with Internet-age publishing to me, but sadly such efforts usually don’t involve commercial publishers**. My recent struggles with Oxford University Press over a subscription to Literary and Linguistic Computing (the only way of joining the ALLC) has added once more to my skepticism towards commercial publishers. And not because their goal is to make money — there’s nothing wrong with that inherently — but because they largely refuse to innovate when it comes to their products and business models. Mailing a paper journal to someone who has no use for it is a waste of resources and a sign that you are out of touch with your customers needs… at least if your customer is this guy.
Do scholars in the Humanities and Social Sciences* still need printed publications and (consequently) publishers?
Do we need publishers if we decide to go all-out digital?
Do we need Open Access?
I have different stances in relation to these questions depending on the hat I’m wearing. Individually I think print publishing is stone dead, but I also notice that by and large my colleagues still rely on printed books and journals much more heavily than digital sources. Regarding the role of publishers and Open Access the situation is equally complex: we need publishers if our culture of communication doesn’t change, because reproducing digitally what we used to create in print is challenging (see this post for some deliberations). If we decide that blog posts can replace journal articles because speed and efficiency ultimately win over perfectionism, since we are no longer producing static objects but a constantly evolving discourse — in that case the future of commercial publishers looks uncertain. Digital toll-access publishing seems to have little traction in our field so far, something that is likely to change with the proliferation of ebooks we are likely to see in the next few years.
Anyhow — what’s your take?
Should we get rid of paper?
Should we get rid of traditional formats and post everything in blogs instead?
Is Cameron Neylon right when he says that the future of research communication is aggregation?
Let me know what you think — perhaps the debate can be a first contribution to Scholarly Communication: Past, present and future.
(*) I believe the situation is fundamentally different in STM, where paper is a thing of the past but publishers are certainly not.
(**) An exception of sorts could to be Liquid Pub, but that project seems focused on STM rather than Hum./Soc.Sci.
Note: this introduction, co-authored with Dieter Stein, is part of the volume Selected Papers from the Berlin 6 Open Access Conference, which will appear via Düsseldorf University Press as an electronic open access publication in the coming weeks. It is also a response to this blog post by Dan Cohen.
Timely or Timeless? The Scholar’s Dilemma. Thoughts on Open Access and the Social Contract of Publishing
Some things don’t change.
We live in a world seemingly over-saturated with information, yet getting it out there in both an appropriate form and a timely fashion is still challenging. Publishing, although the meaning of the word is undergoing significant change in the time of iPads and Kindles, is still a very complex business. In spite of a much faster, cheaper and simpler distribution process, producing scholarly information that is worth publishing is still hard work and so time-consuming that the pace of traditional academic communication sometimes seems painfully slow in comparison to the blogosphere, Wikipedia and the ever-growing buzz of social networking sites and microblogging services. How idiosyncratic does it seem in the age of cloud computing and the real-time web that this electronic volume is published one and a half years after the event its title points to? Timely is something else, you might say.
Dan Cohen, director of the Center for History and New Media at George Mason University, discusses the question of why academics are so obsessed with formal details and consequently so slow to communicate in a blog post titled “The Social Contract of Scholarly Publishing“. In it, Dan retells the experience of working on a book together with colleague Roy Rosenzweig:
“So, what now?” I said to Roy naively. “Couldn’t we just publish what we have on the web with the click of a button? What value does the gap between this stack and the finished product have? Isn’t it 95% done? What’s the last five percent for?”
We stared at the stack some more.
Roy finally broke the silence, explaining the magic of the last stage of scholarly production between the final draft and the published book: “What happens now is the creation of the social contract between the authors and the readers. We agree to spend considerable time ridding the manuscript of minor errors, and the press spends additional time on other corrections and layout, and readers respond to these signals — a lack of typos, nicely formatted footnotes, a bibliography, specialized fonts, and a high-quality physical presentation — by agreeing to give the book a serious read.”
A social contract between author and reader. Nothing more, nothing less.
It may seem either sympathetic or quaint how Roy Rosenzweig elevates the product of scholarship from a mere piece of more or less monitizable content to something of cultural significance, but he also aptly describes what many academics, especially in the humanities, think of as the essence of their work: creating something timeless. That is, in short, why the humanities are still in love with books, why they retain a pace of publishing that is entirely snail-like, both to other academic fields and to the rest of the world. Of course humanities scholars know as well as anyone that nothing is truly timeless and understand that trends and movements shape scholarship just like they shape fashion and music. But there is still a commitment to spend time to deliver something to the reader that is a polished and perfected as one can manage. Something that is not rushed, but refined. Why? Because the reader expects authority from a scholarly work and authority is derived from getting it right to the best of one’s ability.
This is not just a long-winded apology to the readers and contributors to this volume, although an apology for the considerable delay is surely in order, especially taking into account the considerable commitment and patience of our authors (thank you!). Our point is something equally important, something that connects to Roy Rosenzweig’s interpretation of scholarly publishing as a social contract. This publication contains eight papers produced to expand some of the talks held at the Berlin 6 Open Access Conference that took place in November 2008 in Düsseldorf, Germany. While Open Access has successfully moved forward in the past eighteen months and much has been achieved, none of the needs, views and fundamental aspects addressed in this volume — policy frameworks to enable it (Forster, Furlong), economic and organizational structures to make it viable and sustainable (Houghton; Gentil-Beccot, Mele, and Vigen), concrete platforms in different regions (Packer et al) and disciplines (Fritze, Dallmeier-Tiessen and Pfeiffenberger) to serve as models, and finally technical standards to support it (Zier) — none of these things have lost any of their relevance.
Open Access is a timely issue and therefore the discussion about it must be timely as well, but “discussion” in a highly interactive sense is hardly ever what a published volume provides anyway – that is something the blogosphere is already better at. That doesn’t mean that what scholars produce, be it in physics, computer science, law or history should be hallowed tomes that appear years after the controversies around the issues they cover have all but died down, to exist purely as historical documents. If that happens, scholarship itself has become a museal artifact that is obsolete, because a total lack of urgency will rightly suggest to people outside of universities that a field lacks relevance. If we don’t care when it’s published, how important can it be?
But can’t our publications be both timely and timeless at once? In other words, can we preserve the values cited by Roy Rosenzweig, not out of some antiquated fetish for scholarly works as perfect documents, but simply because thoroughly discussed, well-edited and proofed papers and books (and, for that matter, blog posts) are nicer to read and easier to understand than hastily produced ones? Readers don’t like it when their time is wasted; this is as true as ever in the age of information overload. Scientists are expected to get it right, to provide reliable insight and analysis. Better to be slow than to be wrong. In an attention economy, perfectionism pays a dividend of trust.
How does this relate to Open Access? If we look beyond the laws and policy initiatives and platforms for a moment, it seems exceedingly clear that access is ultimately a solvable issue and that we are fast approaching the point where it will be solved. This shift is unlikely to happen next month or next year, but if it hasn’t taken place a decade from now our potential to do innovative research will be seriously impaired and virtually all stakeholders know this. There is growing political pressure and commercial publishers are increasingly experimenting with products that generate revenue without limiting access. Historically, universities, libraries and publishers came into existence to solve the problem of access to knowledge (intellectual and physical access). This problem is arguably in the process of disappearing, and therefore it is of pivotal importance that all those involved in spreading knowledge work together to develop innovative approaches to digital scholarship, instead of clinging to eroding business models. As hard as it is for us to imagine, society may just find that both intellectual and physical access to knowledge are possible without us and that we’re a solution in search of a problem. The remaining barriers to access will gradually be washed away because of the pressure exerted not by lawmakers, librarians and (some) scholars who care about Open Access, but mainly by a general public that increasingly demands access to the research it finances. Openness is not just a technicality. It is a powerful meme that permeates all of contemporary society.
The ability for information to be openly available creates a pressure for it to be. Timeliness and timelessness are two sides of the same coin. In the competitive future of scholarly communication, those who get everything (mostly) right will succeed. Speedy and open publication of relevant, high quality content that is well adjusted to the medium and not just the reproduction of a paper artifact will trump those publications that do not meet all the requirements. The form and pace possible will be undercut by what is considered normal in individual academic disciplines and the conventions of one field will differ from those of another. Publishing less or at a slower pace is unlikely to be perceived as a fault in the long term, with all of us having long gone past the point of informational over-saturation. The ability to effectively make oneself heard (or read), paired with having something meaningful to say, will (hopefully) be of increasing importance, rather than just a high volume of output.
Much of the remaining resistance to Open Access is simply due to ignorance, and to murky premonitions of a new dark age caused by a loss of print culture. Ultimately, there will be a redefinition of the relativities between digital and print publication. There will be a place for both: the advent of mass literacy did not lead to the disappearance of the spoken word, so the advent of the digital age is unlikely to lead to the disappearance of print culture. Transitory compromises such as delayed Open Access publishing are paving the way to fully-digital scholarship. Different approaches will be developed, and those who adapt quickly to a new pace and new tools will benefit, while those who do not will ultimately fall behind.
The ideological dimension of Open Access – whether knowledge should be free – seems strangely out of step with these developments. It is not unreasonable to assume that in the future, if it’s not accessible, it won’t be considered relevant. The logic of informational scarcity has ceased to make sense and we are still catching up with this fundamental shift.
Openness alone will not be enough. The traditional virtues of a publication – the extra 5% – are likely to remain unchanged in their importance while there is such a things as institutional scholarship. We thank the authors of this volume for investing the extra 5% for entering a social contract with their readers and another, considerable higher percentage for their immense patience with us. The result may not be entirely timely and, as has been outlined, nothing is ever truly timeless, but we strongly believe that its relevance is undiminished by the time that has passed.
Open Access, whether 2008 or 2010, remains a challenge – not just to lawmakers, librarians and technologists, but to us, to scholars. Some may rise to the challenge while others remain defiant, but ignorance seems exceedingly difficult to maintain. Now is a bad time to bury one’s head in the sand.
Cornelius Puschmann and Dieter Stein
Edit: this post on (legal aspects of) data sharing by Creative Commons’ Kaitlin Thaney is also highly recommended.
If you’re involved in academic publishing — whether as a researcher, librarian or publisher — data sharing and data publishing are probably hot issues to you. Beyond its versatility as a platform for the dissemination of articles and ebooks, the Internet is increasingly also a place where research data lives. Scholars are no longer restricted to referring to data in their publications or including charts and graphs alongside the text, but can link directly to data published and stored elsewhere, or even embed data into their papers, a process facilitated by standards such as the Resource Description Framework (RDF).
Journals such as Earth System Science Data and the International Journal of Robotics Research give us a glimpse at how this approach might evolve in the future — from journals to data journals, publications which are concerned with presenting valuable data for reuse and pave the way for a research process that is increasingly collaborative. Technology is gradually catching up with the need for genuinely digital publications, a need fueled by the advantages of able to combine text, images, links, videos and a wide variety of datasets to produce a next-generation multi-modal scholarly article. Systems such as Fedora and PubMan are meant to facilitate digital publishing and assure best-practice data provenance and storage. They are able to handle different types of data and associate any number of individual files with a “data paper” that documents them.
However, technology is the much smaller issue when weighing the advantages of data publishing with its challenges — of which there are many, both to practitioners and to those supporting them. Best practices on the individual level are cultural norms that need to be established over time. Scientists still don’t have sufficient incentives to openly share their data, as tenure processes are tied to publishing results based on data, but not on sharing data directly. And finally, technology is prone to failure when there are no agreed-upon standards guiding its use and such standards need to be gradually (meaning painfully slowly, compared with technology’s breakneck pace) established accepted by scholars, not decreed by committee.
In March, Jonathan Rees of NeuroCommons (a project within Creative Commons/Science Commons) published a working paper that outlines such standards for reusable scholarly data. One thing I really appreciate about Rees’ approach is that it is remarkably discipline-independent and not limited to the sciences (vs. social science and the humanities).
Rees outlines how data papers differ from traditional papers:
A data paper is a publication whose primary purpose is to expose and describe data, as opposed to analyze and draw conclusions from it. The data paper enables a division of labor in which those possessing the resources and skills can perform the experiments and observations needed to collect potentially interesting data sets, so that many parties, each with a unique background and ability to analyze the data, may make use of it as they see fit.
The key phrase here (which is why I couldn’t resist boldfacing it) is division of labor. Right now, to use an auto manufacturing analogy, a scholar does not just design a beautiful car (an analysis in the form of a research paper that culminates in observations or theoretical insights), he also has to build an engine (the data that his observations are based on). It doesn’t matter if she is a much better engineer than designer, the car will only run (she’ll only get tenure) if both the engine and the car meet the same requirements. The car analogy isn’t terribly fitting, but it serves to make the point that our current system lacks a division of labor, making it pretty inefficient. It’s based more on the idea of producing smart people than on the idea of getting smart people to produce reusable research.
Rees notes that data publishing is a complicated process and lists a set of rules for successful sharing of scientific data.
From the paper:
- The author must be professionally motivated to publish the data
- The effort and economic burden of publication must be acceptable
- The data must become accessible to potential users
- The data must remain accessible over time
- The data must be discoverable by potential users
- The user’s use of the data must be permitted
- The user must be able to understand what was measured and how (materials and methods)
- The user must be able to understand all computations that were applied and their inputs
- The user must be able to apply standard tools to all file formats
At a glance, these rules signify very different things. #1 and #2 are preconditions, rather than prescriptions while #3 – #6 are concerned with what the author needs to do in order to make the data available. Finally, rules #7 – #10 are corned with making the data as useful to others as possible. Rules #7 -#10 are dependent on who “the user” is and qualify as “do-this-as-best-as-you-can”-style suggestions, rather than strict requirements, not because they aren’t important, but because it’s impossible for the author to guarantee their successful implementation. By contrast, #3 -#6 are concerned with providing and preserving access and are requirements — I can’t guarantee that you’ll understand (or agree with) my electronic dictionary on Halh Mongolian, but I can make sure it’s stored in an institutional or disciplinary repository that is indexed in search engines, mirrored to assure the data can’t be lost and licensed in a legally unambiguous way, rather that upload it to my personal website and hope for the best when it comes to long-term availability, ease of discovery and legal re-use.
Finally, Rees gives some good advice beyond tech issues to publishers who want to implement data publishing:
Set a standard. There won’t be investment in data set reusability unless granting agencies and tenure review boards see it as a legitimate activity. A journal that shows itself credible in the role of enabling reuse will be rewarded with submissions and citations, and will in turn reward authors by helping them obtain recognition for their service to the research community.
This is critical. Don’t wait for universities, grant agencies or even scholars to agree on standards entirely on their own — they can’t and won’t if they don’t know how digital publishing works (legal aspects included). Start an innovative journal and set a standard yourself by being successful.
Encourage use of standard file formats, schemas, and ontologies. It is impossible to know what file formats will be around in ten years, much less a hundred, and this problem worries digital archivists. Open standards such as XML, RDF/XML, and PNG should be encouraged. Plain text is generally transparent but risky due to character encoding ambiguity. File formats that are obviously new or exotic, that lack readily available documentation, or that do not have non-proprietary parsers should not be accepted. Ontologies and schemas should enjoy community acceptance.
An important suggestion that is entirely compatible with linguistic data (dictionaries, word lists, corpora, transcripts, etc) and simplified by the fact that we have comparably small datasets. Even a megaword corpus is small compared to climate data or gene banks.
Aggressively implement a clean separation of concerns. To encourage submissions and reduce the burden on authors and publishers, avoid the imposition of criteria not related to data reuse. These include importance (this will not be known until after others work with the data) and statistical strength (new methods and/or meta-analysis may provide it). The primary peer review criterion should be adequacy of experimental and computational methods description in the service of reuse.
This will be a tough nut to crack, because it sheds tradition to a degree. Relevance was always high on the list of requirements while publications were scarce — paper costs money, therefor what was published had to important to as many people as possible. With data publishing this is no longer true — whether something is important or statistically strong (applying this to linguistics one might say representative, well-documented, etc) is impossible to know from the onset. It’s much more sensible to get it out there and deal with the analysis later, rather than creating an artificial scarcity of data. But it will take time and cultural change to get researchers (and funding both funding agencies and hiring committees) to adapt to this approach.
In the meantime, while we’re still publishing traditional (non-data) papers, we can at least work on making them more accessible. Something like arXiv for linguistics wouldn’t hurt.
At it’s recent annual meeting in Baltimore, the Linguistic Society of America (LSA) passed a resolution on data sharing that is the result of a series of discussions that took place last year, for example at the meeting of the Cyberlinguistics group in Berkeley last June.
Here’s the text (snip):
Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and
Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and
Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; and
Whereas working with linguistic data requires computational tools supporting analysis and collaboration in the field, including standards, analysis tools, and portals that bring together linguistic data and tools to analyze them,
Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to:
make the full data sets behind publications available, subject to all relevant ethical and legal concerns; annotate data and provide metadata according to current standards and best practices; seek wherever possible institutional review board human subjects approval that allows full recordings and transcripts to be made available for other research; contribute to the development of computational tools which support the analysis of linguistic data; work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.
I think it’s great that the LSA is throwing its weight behind this effort and supporting the idea of data sharing. The only minor complaint that I have concerns the wording – what exactly does make available mean? It could mean real Open Access, but also that you’ll email me your datasets if I ask nicely. Or it could mean that a publisher will make your datasets available for a fee – any of these approaches qualify as making data available in this terminology.
So, while I think this is good starting point, more discussion is needed. Especially when it comes to formats, means of access and licensing we need to be more explicit.
Imagine this scenario for a moment: you want to compare the semantic prosody of the verb cause across a dozen languages. If data sharing (and beyond that, resource sharing) were already a reality, we could do something like this:
1. Send a query to WordNetAPI* to identify the closest synonyms of cause in the target languages.
2. Send a query to UnversalCorpusAPI* using the terms we have just identified and specifying a list of megacorpora that we want to search in.
3. Retrieve the result in TEI-XML.
4. Analyze the results in R using the XML package.
The decisive advantage here would be that I only get the data I need, not everything else that’s in those megacorpora that is unrelated to my query. Things just need to be in XML and openly available and I can continue to process them in other ways. This would not just be sharing, but embedding your data in an infrastructure that makes it usable as part of a service. And that would be neat because what good is the data really if it doesn’t come with the tools needed to analyze it? And in 2010 tools=services, not locally installed software.
Now that would be awesome.
(*) fictional at this point, but technically quite feasible.