<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cornelius Puschmann&#039;s Blog</title>
	<atom:link href="http://blog.ynada.com/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.ynada.com</link>
	<description>My new blog on Linguistics, Digital Humanities and Scholarly Communication on the Internet</description>
	<lastBuildDate>Thu, 05 Aug 2010 00:17:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Plotting texts as graphs with R and igraph</title>
		<link>http://blog.ynada.com/303</link>
		<comments>http://blog.ynada.com/303#comments</comments>
		<pubDate>Wed, 04 Aug 2010 23:29:20 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[data]]></category>
		<category><![CDATA[igraph]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[text visualization]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=303</guid>
		<description><![CDATA[I&#8217;ve plotted several word association graphs for this New York Times article (1st paragraph) using R and the igraph library. #1, random method #2, circle method #3, sphere method #4, spring method #5, fruchterman-reingold method # 6, kamada-kawai method #7, graphopt method The red vertices mark cliques. Here&#8217;s the (rough) R code for plotting such [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve plotted several word association graphs for <a href="http://www.nytimes.com/2010/08/01/magazine/01wwln-lede-t.html">this New York Times article</a> (1st paragraph) using <a href="http://www.r-project.org/">R</a> and the <a href="http://igraph.sourceforge.net/">igraph</a> library.</p>
<p>#1, random method</p>
<p><a href="http://www.flickr.com/photos/52016080@N00/4861635742/" title="text-igraph-random by cornelius_puschmann, on Flickr"><img src="http://farm5.static.flickr.com/4093/4861635742_5696e727b3.jpg" width="500" height="500" alt="text-igraph-random" /></a></p>
<p>#2, circle method</p>
<p><a href="http://www.flickr.com/photos/52016080@N00/4861635946/" title="text-igraph-circle by cornelius_puschmann, on Flickr"><img src="http://farm5.static.flickr.com/4117/4861635946_66bf2abd8f.jpg" width="500" height="500" alt="text-igraph-circle" /></a></p>
<p>#3, sphere method</p>
<p><a href="http://www.flickr.com/photos/52016080@N00/4861636150/" title="text-igraph-sphere by cornelius_puschmann, on Flickr"><img src="http://farm5.static.flickr.com/4075/4861636150_d8efba7462.jpg" width="500" height="500" alt="text-igraph-sphere" /></a></p>
<p>#4, spring method</p>
<p><a href="http://www.flickr.com/photos/52016080@N00/4861015597/" title="text-igraph-spring by cornelius_puschmann, on Flickr"><img src="http://farm5.static.flickr.com/4074/4861015597_9176e76456.jpg" width="500" height="500" alt="text-igraph-spring" /></a></p>
<p>#5, fruchterman-reingold method</p>
<p><a href="http://www.flickr.com/photos/52016080@N00/4861015721/" title="text-igraph-fruchterman-reingold by cornelius_puschmann, on Flickr"><img src="http://farm5.static.flickr.com/4134/4861015721_b609b5a5fd.jpg" width="500" height="500" alt="text-igraph-fruchterman-reingold" /></a></p>
<p># 6, kamada-kawai method </p>
<p><a href="http://www.flickr.com/photos/52016080@N00/4861636642/" title="text-igraph-kamada-kawai by cornelius_puschmann, on Flickr"><img src="http://farm5.static.flickr.com/4078/4861636642_c9bd7e8cde.jpg" width="500" height="500" alt="text-igraph-kamada-kawai" /></a></p>
<p>#7, graphopt method</p>
<p><a href="http://www.flickr.com/photos/52016080@N00/4861636756/" title="text-igraph-graphopt by cornelius_puschmann, on Flickr"><img src="http://farm5.static.flickr.com/4134/4861636756_173f21413e.jpg" width="500" height="500" alt="text-igraph-graphopt" /></a></p>
<p>The red vertices mark <a href="http://en.wikipedia.org/wiki/Clique_(graph_theory)">cliques</a>. Here&#8217;s the (rough) R code for plotting such graphs:</p>
<p><code>rm(list=ls());</p>
<p>library("igraph");<br />
library("Cairo");</p>
<p># read parameters<br />
print("Text-as-Graph for R 0.1");<br />
print("------------------------------------");</p>
<p>print("Path (no trailing slash): ");<br />
datafolder <- scan(file="", what="char");</p>
<p>print("Text file: ");<br />
datafile <- scan(file="", what="char");</p>
<p>txt <- scan(paste(datafolder, datafile, sep="/"), what="char", sep="\n", encoding="UTF-8");</p>
<p>print("Width/Height (e.g. 1024x768): ");<br />
res <- scan(file="", what="char");<br />
rwidth <- unlist(strsplit(res, "x"))[1]<br />
rheight <- unlist(strsplit(res, "x"))[2]</p>
<p>words <- unlist(strsplit(gsub("[[:punct:]]", " ", tolower(txt)), "[[:space:]]+"));</p>
<p>g.start <- 1;</p>
<p>g.end <- length(words) - 1;</p>
<p>assocs <- matrix(nrow=g.end, ncol=2)</p>
<p>for (i in g.start:g.end)<br />
{<br />
	assocs[i,1] <- words[i];<br />
	assocs[i,2] <- words[i+1];<br />
	print(paste("Pass #", i, " of ", g.end, ". ", "Node word is ", toupper(words[i]), ".", sep=""));<br />
}</p>
<p>print("Build graph from data frame...");<br />
g.assocs <- graph.data.frame(assocs, directed=F);</p>
<p>print("Label vertices...");<br />
V(g.assocs)$label <- V(g.assocs)$name;</p>
<p>print("Associate colors...");<br />
V(g.assocs)$color <- "Gray";</p>
<p>print("Find cliques...");<br />
V(g.assocs)[unlist(largest.cliques(g.assocs))]$color <- "Red";</p>
<p>print("Plotting random graph...");<br />
CairoPNG(paste(datafolder, "/", "text-igraph-random.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));<br />
plot(g.assocs, layout=layout.random, vertex.size=4, vertex.label.dist=0);<br />
dev.off();</p>
<p>print("Plotting circle graph...");<br />
CairoPNG(paste(datafolder, "/", "text-igraph-circle.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));<br />
plot(g.assocs, layout=layout.circle, vertex.size=4, vertex.label.dist=0);<br />
dev.off();</p>
<p>print("Plotting sphere graph...");<br />
CairoPNG(paste(datafolder, "/", "text-igraph-sphere.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));<br />
plot(g.assocs, layout=layout.sphere, vertex.size=4, vertex.label.dist=0);<br />
dev.off();</p>
<p>print("Plotting spring graph...");<br />
CairoPNG(paste(datafolder, "/", "text-igraph-spring.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));<br />
plot(g.assocs, layout=layout.spring, vertex.size=4, vertex.label.dist=0);<br />
dev.off();</p>
<p>print("Plotting fruchterman-reingold graph...");<br />
CairoPNG(paste(datafolder, "/", "text-igraph-fruchterman-reingold.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));<br />
plot(g.assocs, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label.dist=0);<br />
dev.off();</p>
<p>print("Plotting kamada-kawai graph...");<br />
CairoPNG(paste(datafolder, "/", "text-igraph-kamada-kawai.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));<br />
plot(g.assocs, layout=layout.kamada.kawai, vertex.size=4, vertex.label.dist=0);<br />
dev.off();</p>
<p>#CairoPNG(paste(datafolder, "/", "text-igraph-reingold-tilford.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));<br />
#plot(g.assocs, layout=layout.reingold.tilford, vertex.size=4, vertex.label.dist=0);<br />
#dev.off();</p>
<p>print("Plotting graphopt graph...");<br />
CairoPNG(paste(datafolder, "/", "text-igraph-graphopt.png", sep=""), width=as.numeric(rwidth), height=as.numeric(rheight));<br />
plot(g.assocs, layout=layout.graphopt, vertex.size=4, vertex.label.dist=0);<br />
dev.off();</p>
<p>print("Done!");</code></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/303/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Do we need paper publications in the Humanities and Social Sciences? Do we need commercial publishers?</title>
		<link>http://blog.ynada.com/291</link>
		<comments>http://blog.ynada.com/291#comments</comments>
		<pubDate>Tue, 03 Aug 2010 14:52:12 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[digital scholarly communication]]></category>
		<category><![CDATA[open access]]></category>
		<category><![CDATA[publishing]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=291</guid>
		<description><![CDATA[I read about this new book series titled Scholarly Communication: Past, present and future of knowledge inscription this morning on the Humanist mailing list. Since scholarly communication is one my main research interests, I&#8217;m thrilled to hear that there will be a series devoted to publications focusing on the topic, edited and reviewed by a [...]]]></description>
			<content:encoded><![CDATA[<p>I read about <a href="http://brill.nl/sc">this new book series</a> titled <em>Scholarly Communication: Past, present and future of knowledge inscription</em> this morning on the <a href="http://digitalhumanities.org/humanist/">Humanist</a> mailing list. Since scholarly communication is one my main research interests, I&#8217;m thrilled to hear that there will be a series devoted to publications focusing on the topic, edited and reviewed by a long list of renown scholars in the field.</p>
<p>On the other hand it&#8217;s debatable (see reactions by <a href="http://twitter.com/cyberscientist/status/20220279721">Michael Netwich</a> and <a href="http://twitter.com/ttasovac/status/20221265039">Toma Tasovac</a>) whether a <em>book series</em> on the future of scholarly communication is not a tad anachronistic, assuming it is published exclusively in print (seems to be the case from the look of the announcement on the website). New approaches, such as the crowdsourcing angles of <a href="http://hackingtheacademy.org/">Hacking the Academy</a> or <a href="http://digitalhumanitiesnow.org/">Digital Humanities Now</a>, seem more in sync with Internet-age publishing to me, but sadly such efforts usually don&#8217;t involve commercial publishers**. My <a href="http://twitter.com/coffee001/status/18946683196">recent struggles</a> with Oxford University Press over a subscription to <a href="http://llc.oxfordjournals.org/">Literary and Linguistic Computing</a> (the only way of joining the <a href="http://www.allc.org/">ALLC</a>) has added once more to my skepticism towards commercial publishers. And <strong>not </strong>because their goal is to make money &#8212; there&#8217;s nothing wrong with that inherently &#8212; but because they largely refuse to innovate when it comes to their products and business models. Mailing a paper journal to someone who has no use for it is a waste of resources and a sign that you are out of touch with your customers needs&#8230; at least if your customer is this guy.</p>
<p>Do scholars in the Humanities and Social Sciences* still need printed publications and (consequently) publishers?</p>
<p>Do we need publishers if we decide to go all-out digital?</p>
<p>Do we need Open Access?</p>
<p>I have different stances in relation to these questions depending on the hat I&#8217;m wearing. Individually I think print publishing is stone dead, but I also notice that by and large my colleagues still rely on printed books and journals much more heavily than digital sources. Regarding the role of publishers and Open Access the situation is equally complex: we need publishers if our culture of communication doesn&#8217;t change, because reproducing digitally what we used to create in print is challenging (see <a href="http://blog.ynada.com/242">this post</a> for some deliberations). If we decide that blog posts can replace journal articles because speed and efficiency ultimately win over perfectionism, since we are no longer producing static objects but a constantly evolving discourse &#8212; in that case the future of commercial publishers looks uncertain. Digital toll-access publishing seems to have little traction in our field so far, something that is likely to change with the proliferation of ebooks we are likely to see in the next few years.</p>
<p>Anyhow &#8212; what&#8217;s your take?</p>
<p>Should we get rid of paper?</p>
<p>Should we get rid of traditional formats and post everything in blogs instead?</p>
<p>Is Cameron Neylon right when he says that <a href="http://cameronneylon.net/blog/the-future-of-research-communication-is-aggregation/">the future of research communication is aggregation</a>?</p>
<p>Let me know what you think &#8212; perhaps the debate can be a first contribution to <em>Scholarly Communication: Past, present and future</em>. <img src='http://blog.ynada.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>(*) I believe the situation is fundamentally different in STM, where paper is a thing of the past but publishers are certainly not.</p>
<p>(**) An exception of sorts could to be <a href="http://project.liquidpub.org/">Liquid Pub</a>, but that project seems focused on STM rather than Hum./Soc.Sci.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/291/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Graphing Twitter friends/followers with R (updated)</title>
		<link>http://blog.ynada.com/279</link>
		<comments>http://blog.ynada.com/279#comments</comments>
		<pubDate>Thu, 24 Jun 2010 22:37:23 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=279</guid>
		<description><![CDATA[Here&#8217;s an updated version of my script from last month, something I&#8217;ve been meaning to do for a while. I thank Anatol Stefanowitsch and Gábor Csárdi for improving my quite sloppy code. # Load twitteR and igraph packages. library(twitteR) library(igraph) # Start a Twitter session. sess]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s an updated version of <a href="http://blog.ynada.com/247">my script from last month</a>, something I&#8217;ve been meaning to do for a while. I thank <a href="http://www-user.uni-bremen.de/~anatol/">Anatol Stefanowitsch</a> and <a href="http://cneuro.rmki.kfki.hu/people/csardi">Gábor Csárdi</a> for improving my quite sloppy code.</p>
<p><code><br />
# Load twitteR and igraph packages.<br />
library(twitteR)<br />
library(igraph)<br />
</code><br />
<code><br />
# Start a Twitter session.<br />
sess <- initSession('USERNAME', 'PASSWORD')<br />
</code><br />
<code><br />
# Retrieve a maximum of 20 friends/followers for yourself or someone else Note that<br />
# at the moment, the limit parameter does not [yet] seem to be working.<br />
friends.object <- userFriends('USERNAME', n=20, sess)<br />
followers.object <- userFollowers('USERNAME', n=20, sess)<br />
</code><br />
<code><br />
# Retrieve the names of your friends and followers from the friend<br />
# and follower objects.<br />
friends <- sapply(friends.object,name)<br />
followers <- sapply(followers.object,name)<br />
</code><br />
<code><br />
# Create a data frame that relates friends and followers to you for expression in the graph<br />
relations <- merge(data.frame(User='YOUR_NAME', Follower=friends), data.frame(User=followers, Follower='YOUR_NAME'), all=T)<br />
</code><br />
<code><br />
# Create graph from relations.<br />
g <- graph.data.frame(relations, directed = T)<br />
</code><br />
<code><br />
# Assign labels to the graph (=people's names)<br />
V(g)$label <- V(g)$name<br />
</code><br />
<code><br />
# Plot the graph using plot() or tkplot().<br />
tkplot(g)<br />
</code></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/279/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Post-event Twitter stats for #THATcamp</title>
		<link>http://blog.ynada.com/265</link>
		<comments>http://blog.ynada.com/265#comments</comments>
		<pubDate>Wed, 26 May 2010 16:08:17 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[data]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[THATcamp]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=265</guid>
		<description><![CDATA[I thought I&#8217;d post an updated version of the simple stats on Twitter activity presented here. The data in the older post was collected before THATcamp took place, the graphs below show the activity during and after the camp. The tweets I&#8217;ve collected are also available here (my own file) and on TwapperKeeper. Tweets over [...]]]></description>
			<content:encoded><![CDATA[<p>I thought I&#8217;d post an updated version of the simple stats on Twitter activity presented <a href="http://thatcamp.org/2010/twitterstat/">here</a>. The data in the older post was collected before <a href="http://thatcamp.org">THATcamp</a> took place, the graphs below show the activity during and after the camp.</p>
<p>The tweets I&#8217;ve collected are also available <a href='http://blog.ynada.com/wp-content/uploads/2010/05/thatcamp-all.zip'>here</a> (my own file) and on <a href="http://www.twapperkeeper.com/hashtag/thatcamp">TwapperKeeper</a>.</p>
<p class"clear: both" />
<p><strong>Tweets over time (roughly 14th of May to 24th)</strong></p>
<p><a href="http://blog.ynada.com/wp-content/uploads/2010/05/tc-time2.png"><img src="http://blog.ynada.com/wp-content/uploads/2010/05/tc-time2-300x187.png" alt="" title="tc-time2" width="300" height="187" class="aligncenter size-medium wp-image-276" /></a></p>
<p class"clear: both" />
<p><strong>Most active users</strong></p>
<p><a href="http://blog.ynada.com/wp-content/uploads/2010/05/tc-activity.png"><img src="http://blog.ynada.com/wp-content/uploads/2010/05/tc-activity-300x175.png" alt="" title="tc-activity" width="300" height="175" class="aligncenter size-medium wp-image-266" /></a></p>
<p class"clear: both" />
<p><strong>Most @-messaged users</strong></p>
<p><a href="http://blog.ynada.com/wp-content/uploads/2010/05/tc-ats.png"><img src="http://blog.ynada.com/wp-content/uploads/2010/05/tc-ats-300x175.png" alt="" title="tc-ats" width="300" height="175" class="aligncenter size-medium wp-image-267" /></a></p>
<p class"clear: both" />
<p><strong>Most retweeted users</strong></p>
<p><a href="http://blog.ynada.com/wp-content/uploads/2010/05/tc-rts.png"><img src="http://blog.ynada.com/wp-content/uploads/2010/05/tc-rts-300x175.png" alt="" title="tc-rts" width="300" height="175" class="aligncenter size-medium wp-image-268" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/265/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>URLs tweeted at #THATCamp (all 230 of them)</title>
		<link>http://blog.ynada.com/261</link>
		<comments>http://blog.ynada.com/261#comments</comments>
		<pubDate>Mon, 24 May 2010 04:29:46 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[data]]></category>
		<category><![CDATA[links]]></category>
		<category><![CDATA[THATcamp]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=261</guid>
		<description><![CDATA[I&#8217;ve data-mined the #thatcamp hashtag a bit more and extracted all 230 links that were tweeted recently (also includes some of THATCamp Paris). Enjoy (or go here to view the table inside Google Docs)]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve data-mined the <a href="http://search.twitter.com/search?q=%23thatcamp">#thatcamp hashtag</a> a bit more and extracted all 230 links that were tweeted recently (also includes some of <a href="http://tcp.hypotheses.org/">THATCamp Paris</a>). Enjoy <img src='http://blog.ynada.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p><iframe width='500' height='300' frameborder='0' src='http://spreadsheets.google.com/pub?key=0AtlnDQYcdMO1dGd5TEpFOXRoam5CLWZFREJiSlBfY1E&#038;single=true&#038;gid=0&#038;output=html&#038;widget=true'></iframe></p>
<p>(or go <a href="http://spreadsheets.google.com/pub?key=0AtlnDQYcdMO1dGd5TEpFOXRoam5CLWZFREJiSlBfY1E&#038;single=true&#038;gid=0&#038;output=html">here</a> to view the table inside Google Docs)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/261/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Code and brief instruction for graphing Twitter with R</title>
		<link>http://blog.ynada.com/247</link>
		<comments>http://blog.ynada.com/247#comments</comments>
		<pubDate>Sun, 23 May 2010 22:54:37 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[THATcamp]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=247</guid>
		<description><![CDATA[Edit: I&#8217;ve posted an updated version of the script here. It is not quite as compressed as Anatol&#8217;s version, but I think it&#8217;s a decent compromise between readability and efficiency. I hacked together some code for R last night to visualize a Twitter graph (=who you are following and who is following you) that I [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Edit:</strong> I&#8217;ve posted an updated version of the script <a href="http://blog.ynada.com/279">here</a>. It is not quite as compressed as Anatol&#8217;s version, but I think it&#8217;s a decent compromise between readability and efficiency. <img src='http://blog.ynada.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>I hacked together some code for <a href="http://www.r-project.org/">R</a> last night to visualize a Twitter graph (=who you are following and who is following you) that I briefly showed at <a href="http://thatcamp.org/2010/visualizing-text/">the session on visualizing text</a> today at <a href="http://thatcamp.org/">THATCamp</a> and that I wanted to share. My comments in the code are very basic and there is much to improve, but in the spirit of &#8220;release early, release often&#8221;, I think it&#8217;s better to get it out there right away.</p>
<p>Ingredients:</p>
<ul>
<li><a href="http://www.r-project.org/">R</a></li>
<li><a href="http://cran.r-project.org/web/packages/twitteR/index.html">twitteR package</a></li>
<li><a href="http://igraph.sourceforge.net/">igraph package</a></li>
</ul>
<p>Note that packages are most easily installed with the <code>install.packages()</code> function inside of R, so R is really the only thing you need to download initially.</p>
<p><strong>Code:</strong></p>
<p><code># Load twitteR package<br />
library(twitteR)</code></p>
<p><code># Load igraph package<br />
library(igraph)</code><br />
<code><br />
# Set up friends and followers as vectors. This, along with some stuff below, is not really necessary, but the result of my relative inability to deal with the twitter user object in an elegant way. I'm hopeful that I will figure out a way of shortening this in the future</code></p>
<p><code>friends <- as.character()<br />
followers <- as.character()</code></p>
<p><code># Start an Twitter session. Note that the user through whom the session is started doesn't have to be the one that your search for in the next step. I'm using myself (coffee001) in the code below, but you could authenticate with your username and then search for somebody else.</code></p>
<p><code>sess <- initSession('coffee001', 'mypassword')</code><br />
<code><br />
# Retrieve a maximum of 500 friends for user 'coffee001'.</code></p>
<p><code>friends.object <- userFriends('coffee001', n=500, sess)</code></p>
<p><code># Retrieve a maximum of 500 followers for 'coffee001'. Note that retrieving many/all of your followers will create a very busy graph, so if you are experimenting it's better to start with a small number of people (I used 25 for the graph below).</code></p>
<p><code>followers.object <- userFollowers('coffee001', n=500, sess)</code></p>
<p><code># This code is necessary at the moment, but only because I don't know how to slice just the "name" field for friends and followers from the list of user objects that twitteR retrieves. I am 100% sure there is an alternative to looping over the objects, I just haven't found it yet. Let me know if you do...</code></p>
<p><code>for (i in 1:length(friends.object))<br />
{<br />
	friends <- c(friends, friends.object[[i]]@name);<br />
}</code><br />
<code><br />
for (i in 1:length(followers.object))<br />
{<br />
	followers <- c(followers, followers.object[[i]]@name);<br />
}</code></p>
<p><code><br />
# Create data frames that relate friends and followers to the user you search for and merge them.</code></p>
<p><code>relations.1 <- data.frame(User='Cornelius', Follower=friends)<br />
relations.2 <- data.frame(User=followers, Follower='Cornelius')<br />
relations <- merge(relations.1, relations.2, all=T)</code></p>
<p><code># Create graph from relations.</code></p>
<p><code>g <- graph.data.frame(relations, directed = T)</code></p>
<p><code># Assign labels to the graph (=people's names)</code></p>
<p><code>V(g)$label <- V(g)$name</code></p>
<p><code># Plot the graph.</code></p>
<p><code>plot(g)</code></p>
<p>For the screenshot below I've used the <code>tkplot()</code> method instead of <code>plot()</code>, which allows you to move around and highlight elements interactively with the mouse after plotting them. The graph only shows 20 people in order to keep the complexity manageable. </p>
<p><a href="http://blog.ynada.com/wp-content/uploads/2010/05/twitter.png"><img src="http://blog.ynada.com/wp-content/uploads/2010/05/twitter-300x175.png" alt="" title="twitter" width="300" height="175" class="alignleft size-medium wp-image-251" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/247/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Timely or Timeless? The Scholar&#8217;s Dilemma.</title>
		<link>http://blog.ynada.com/242</link>
		<comments>http://blog.ynada.com/242#comments</comments>
		<pubDate>Wed, 19 May 2010 20:50:42 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[digital scholarship]]></category>
		<category><![CDATA[open access]]></category>
		<category><![CDATA[publishing]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=242</guid>
		<description><![CDATA[Note: this introduction, co-authored with Dieter Stein, is part of the volume Selected Papers from the Berlin 6 Open Access Conference, which will appear via Düsseldorf University Press as an electronic open access publication in the coming weeks. It is also a response to this blog post by Dan Cohen. Timely or Timeless? The Scholar&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Note:</strong> this introduction, co-authored with <a href="http://www.phil-fak.uni-duesseldorf.de/anglistik3/stein/">Dieter Stein</a>, is part of the volume <em>Selected Papers from the Berlin 6 Open Access Conference</em>, which will appear via Düsseldorf University Press as an electronic open access publication in the coming weeks. It is also a response to <a href="http://www.dancohen.org/2010/03/05/the-social-contract-of-scholarly-publishing/">this blog post</a> by <a href="http://www.dancohen.org/">Dan Cohen</a>.</p>
<p><strong>Timely or Timeless? The Scholar&#8217;s Dilemma. Thoughts on Open Access and the Social Contract of Publishing </strong></p>
<p>Some things don&#8217;t change.</p>
<p>We live in a world seemingly over-saturated with information, yet getting it out there in both an appropriate form and a timely fashion is still challenging. Publishing, although the meaning of the word is undergoing significant change in the time of iPads and Kindles, is still a very complex business. In spite of a much faster, cheaper and simpler distribution process, producing scholarly information that is worth publishing is still hard work and so time-consuming that the pace of traditional academic communication sometimes seems painfully slow in comparison to the blogosphere, Wikipedia and the ever-growing buzz of social networking sites and microblogging services. How idiosyncratic does it seem in the age of cloud computing and the real-time web that this electronic volume is published one and a half years after the event its title points to? Timely is something else, you might say. </p>
<p>Dan Cohen, director of the Center for History and New Media at George Mason University, discusses the question of why academics are so obsessed with formal details and consequently so slow to communicate in a blog post titled &#8220;<a href="http://www.dancohen.org/2010/03/05/the-social-contract-of-scholarly-publishing/">The Social Contract of Scholarly Publishing</a>&#8220;. In it, Dan retells the experience of working on a book together with colleague Roy Rosenzweig: </p>
<blockquote>
<p>“So, what now?” I said to Roy naively. “Couldn’t we just publish what we have on the web with the click of a button? What value does the gap between this stack and the finished product have? Isn’t it 95% done? What’s the last five percent for?” </p>
<p>We stared at the stack some more. </p>
<p>Roy finally broke the silence, explaining the magic of the last stage of scholarly production between the final draft and the published book: “What happens now is the creation of the social contract between the authors and the readers. We agree to spend considerable time ridding the manuscript of minor errors, and the press spends additional time on other corrections and layout, and readers respond to these signals — a lack of typos, nicely formatted footnotes, a bibliography, specialized fonts, and a high-quality physical presentation — by agreeing to give the book a serious read.” </p>
</blockquote>
<p>A social contract between author and reader. Nothing more, nothing less. </p>
<p>It may seem either sympathetic or quaint how Roy Rosenzweig elevates the product of scholarship from a mere piece of more or less monitizable content to something of cultural significance, but he also aptly describes what many academics, especially in the humanities, think of as the essence of their work: creating something timeless. That is, in short, why the humanities are still in love with books, why they retain a pace of publishing that is entirely snail-like, both to other academic fields  and to the rest of the world. Of course humanities scholars know as well as anyone that nothing is truly timeless and understand that trends and movements shape scholarship just like they shape fashion and music. But there is still a commitment to spend time to deliver something to the reader that is a polished and perfected as one can manage. Something that is not rushed, but refined. Why? Because the reader expects authority from a scholarly work and authority is derived from getting it right to the best of one&#8217;s ability.</p>
<p>This is not just a long-winded apology to the readers and contributors to this volume, although an apology for the considerable delay is surely in order, especially taking into account the considerable commitment and patience of our authors (thank you!). Our point is something equally important, something that connects to Roy Rosenzweig&#8217;s interpretation of scholarly publishing as a social contract. This publication contains eight papers produced to expand some of the talks held at the Berlin 6 Open Access Conference that took place in November 2008 in Düsseldorf, Germany. While Open Access has successfully moved forward in the past eighteen months and much has been achieved, none of the needs, views and fundamental aspects addressed in this volume &#8212; policy frameworks to enable it (Forster, Furlong), economic and organizational structures to make it viable and sustainable (Houghton; Gentil-Beccot, Mele, and Vigen), concrete platforms in different regions (Packer et al) and disciplines (Fritze, Dallmeier-Tiessen and Pfeiffenberger) to serve as models, and finally technical standards to support it (Zier) &#8212; none of these things have lost any of their relevance. </p>
<p>Open Access is a timely issue and therefore the discussion about it must be timely as well, but “discussion” in a highly interactive sense is hardly ever what a published volume provides anyway – that is something the blogosphere is already better at. That doesn&#8217;t mean that what scholars produce, be it in physics, computer science, law or history should be hallowed tomes that appear years after the controversies around the issues they cover have all but died down, to exist purely as historical documents. If that happens, scholarship itself has become a museal artifact that is obsolete, because a total lack of urgency will rightly suggest to people outside of universities that a field lacks relevance. If we don&#8217;t care when it&#8217;s published, how important can it be?</p>
<p>But can&#8217;t our publications be both timely and timeless at once? In other words, can we preserve the values cited by Roy Rosenzweig, not out of some antiquated fetish for scholarly works as perfect documents, but simply because thoroughly discussed, well-edited and proofed papers and books (and, for that matter, blog posts) are nicer to read and easier to understand than hastily produced ones? Readers don&#8217;t like it when their time is wasted; this is as true as ever in the age of information overload. Scientists are expected to get it right, to provide reliable insight and analysis. Better to be slow than to be wrong. In an attention economy, perfectionism pays a dividend of trust.</p>
<p>How does this relate to Open Access? If we look beyond the laws and policy initiatives and platforms for a moment, it seems exceedingly clear that access is ultimately a solvable issue and that we are fast approaching the point where it will be solved. This shift is unlikely to happen next month or next year, but if it hasn&#8217;t taken place a decade from now our potential to do innovative research will be seriously impaired and virtually all stakeholders know this. There is growing political pressure and commercial publishers are increasingly experimenting with products that generate revenue without limiting access. Historically, universities, libraries and publishers came into existence to solve the problem of access to knowledge (intellectual and physical access). This problem is arguably in the process of disappearing, and therefore it is of pivotal importance that all those involved in spreading knowledge work together to develop innovative approaches to digital scholarship, instead of clinging to eroding business models. As hard as it is for us to imagine, society may just find that both intellectual and physical access to knowledge are possible without us and that we&#8217;re a solution in search of a problem. The remaining barriers to access will gradually be washed away because of the pressure exerted not by lawmakers, librarians and (some) scholars who care about Open Access, but mainly by a general public that increasingly demands access to the research it finances. Openness is not just a technicality. It is a powerful meme that permeates all of contemporary society. </p>
<p>The ability for information to be openly available creates a pressure for it to be. Timeliness and timelessness are two sides of the same coin. In the competitive future of scholarly communication, those who get everything (mostly) right will succeed. Speedy and open publication of relevant, high quality content that is well adjusted to the medium and not just the reproduction of a paper artifact will trump those publications that do not meet all the requirements. The form and pace possible will be undercut by what is considered normal in individual academic disciplines and the conventions of one field will differ from those of another. Publishing less or at a slower pace is unlikely to be perceived as a fault in the long term, with all of us having long gone past the point of informational over-saturation. The ability to effectively make oneself heard (or read), paired with having something meaningful to say, will (hopefully) be of increasing importance, rather than just a high volume of output. </p>
<p>Much of the remaining resistance to Open Access is simply due to ignorance, and to murky premonitions of a new dark age caused by a loss of print culture. Ultimately, there will be a redefinition of the relativities between digital and print publication. There will be a place for both: the advent of mass literacy did not lead to the disappearance of the spoken word, so the advent of the digital age  is unlikely to lead to the disappearance of print culture. Transitory compromises such as delayed Open Access publishing are paving the way to fully-digital scholarship. Different approaches will be developed, and those who adapt quickly to a new pace and new tools will benefit, while those who do not will ultimately fall behind.</p>
<p>The ideological dimension of Open Access – whether knowledge should be free – seems strangely out of step with these developments. It is not unreasonable to assume that in the future, if it&#8217;s not accessible, it won&#8217;t be considered relevant. The logic of informational scarcity has ceased to make sense and we are still catching up with this fundamental shift.</p>
<p>Openness alone will not be enough. The traditional virtues of a publication – the extra 5% – are likely to remain unchanged in their importance while there is such a things as institutional scholarship. We thank the authors of  this volume for investing the extra 5% for entering a social contract with their readers and another, considerable higher percentage for their immense patience with us. The result may not be entirely timely and, as has been outlined, nothing is ever truly timeless, but we strongly believe that its relevance is undiminished by the time that has passed.</p>
<p>Open Access, whether 2008 or 2010, remains a challenge – not just to lawmakers, librarians and technologists, but to us, to scholars. Some may rise to the challenge while others remain defiant, but ignorance seems exceedingly difficult to maintain. Now is a bad time to bury one&#8217;s head in the sand.</p>
<p>Düsseldorf,</p>
<p>Mai 2010</p>
<p>Cornelius Puschmann and Dieter Stein</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/242/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Visualizing text: theory and practice</title>
		<link>http://blog.ynada.com/235</link>
		<comments>http://blog.ynada.com/235#comments</comments>
		<pubDate>Wed, 19 May 2010 00:56:01 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[methods]]></category>
		<category><![CDATA[text visualization]]></category>
		<category><![CDATA[THATcamp]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=235</guid>
		<description><![CDATA[Note: I&#8217;ve also posted this on thatcamp.org. Bad, bad me &#8212; of course I&#8217;ve been putting off writing up my ideas and thoughts for THATcamp almost to the latest possible moment. Waiting so long has one definitive advantage though: I get to point to some of the interesting suggestions that have already been posted here [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Note:</strong> I&#8217;ve also posted this on <a href="http://thatcamp.org/2010/visualizing-text/">thatcamp.org</a>.</p>
<p>Bad, bad me &#8212; of course I&#8217;ve been putting off writing up my ideas and thoughts for <a href="http://thatcamp.org/">THATcamp</a> almost to the latest possible moment. Waiting so long has one definitive advantage though: I get to point to some of the interesting suggestions that have already been posted here and (hopefully) add to them.</p>
<p>I&#8217;d like to both <em>discuss and do</em> text visualization. Charts, maps, infographics and other forms of visualization are becoming increasingly popular as we are faced with large quantities of textual data from a variety of sources. To linguists and literary scholars, visualizing texts can (among other things) be interesting to uncover things about language as such (corpus linguistics) and about individual texts and their authors (narratology, stylometrics, authorship attribution), while to a wide range of other disciplines the things that can be inferred from visualization (social change, spreading of cultural memes) beyond the text itself can be interesting.</p>
<p><strong>What can we potentially visualize?</strong> This may seem to be a naive question, but I believe that only by trying out virtually everything we can think of (distribution of letters, words, word classes, n-grams, paragraphs, &#8230;; patterning of narrative strands, structure of dialog, occurrence of specific rhetorical devices; references to places, people, points in time&#8230;; emotive expressions, abstract verbs, dream sequences&#8230; you name it) can we reach conclusions about what (if anything!) these things might mean.</p>
<p><strong>How can we visualize text?</strong> If we consider for a moment how we mostly visualize text today it quickly becomes apparent that there is much more we could be doing. Bar plots, line graphs and pie charts are largely instruments for quantification, yet very often quantitative relations between elements aren&#8217;t our only concern when studying text. <a href="http://wordle.net/">Word clouds</a> add plasticity, yet they eliminate the sequential patterning of a text and thus do not represent its rhetorical development from beginning to end. <a href="http://www.notcot.com/archives/2008/04/stefanie-posave.php">Trees and maps</a> are interesting in this regard, but by and large we hardly utilize the full potential of visualization as a form of analysis, for example by using lines, shapes, color (!) and beyond that, movement (video) in a way that suits the kind of data we are dealing with.</p>
<p><strong>What tools can we use to do visualization?</strong> I&#8217;m very interested in <a href="http://processing.org/">Processing</a> and have played with it, also more extensively with <a href="http://www.r-project.org/">R</a> and <a href="http://www.nltk.org/">NLTK/Python</a>. Tools for rendering data, such as <a href="http://code.google.com/intl/de/apis/charttools/">Google Chart Tools</a>, <a href="http://igraph.sourceforge.net/">igraph</a> and <a href="http://www.rgraph.net/">RGraph</a> are also interesting. Other, non-statistical tools are also an option: free hand drawing tools and web-based services like <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/">Many Eyes</a>. Visualization doesn&#8217;t need to be restricted to computation/statistics. <a href="http://www.itsbeenreal.co.uk/">Stephanie Posavec</a>&#8216;s trees are a dynamic mix of automation and manual annotation and demonstrate that visualizations are rhetorically powerful interpretations themselves.</p>
<p>I hope that some of the abovementioned things connect to other THATcampers&#8217; ideas, e.g. <a href="http://thatcamp.org/2010/text-mining-scarce-sources/">Lincoln Mullen&#8217;s post on mining scarce sources</a> and <a href="http://thatcamp.org/2010/teaching-using-visualization/">Bill Ferster&#8217;s post on teaching using visualization</a>.</p>
<p>Don&#8217;t get me started on the potential for teaching. Ultimately translating a text into another form is a unique kind of critical engagement: you&#8217;re uncovering, interpreting and making an argument all at once, both to the text in question and to yourself.</p>
<p>Anyway &#8212; anything from discussing theoretical issues of visualization to sharing code snippets would fit into this session and I&#8217;m looking forward to hearing other campers&#8217; thoughts and experiences on the subject.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/235/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A linguist&#8217;s perspective on Creative Commons&#8217; data sharing whitepaper</title>
		<link>http://blog.ynada.com/228</link>
		<comments>http://blog.ynada.com/228#comments</comments>
		<pubDate>Tue, 04 May 2010 22:18:03 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[creative commons]]></category>
		<category><![CDATA[data publishing]]></category>
		<category><![CDATA[open access]]></category>
		<category><![CDATA[open data]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=228</guid>
		<description><![CDATA[Edit: this post on (legal aspects of) data sharing by Creative Commons&#8217; Kaitlin Thaney is also highly recommended. Edit #2: This is another cross post with cyberling.org. If you&#8217;re involved in academic publishing &#8212; whether as a researcher, librarian or publisher &#8212; data sharing and data publishing are probably hot issues to you. Beyond its [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Edit:</strong> <a href="http://blogs.talis.com/nodalities/2010/02/sharing-data-on-the-web.php">this post</a> on (legal aspects of) data sharing by Creative Commons&#8217; Kaitlin Thaney is also highly recommended.</p>
<p><strong>Edit #2:</strong> This is another <a href="http://cyberling.org/node/31">cross post</a> with <a href="http://cyberling.org/">cyberling.org</a>.</p>
<p>If you&#8217;re involved in academic publishing &#8212; whether as a researcher, librarian or publisher &#8212; data sharing and data publishing are probably hot issues to you. Beyond its versatility as a platform for the dissemination of articles and ebooks, the Internet is increasingly also <a href="http://scientificdatasharing.com/">a place where research data lives</a>. Scholars are no longer restricted to referring to data in their publications or including charts and graphs alongside the text, but can link directly to data published and stored elsewhere, or even embed data into their papers, a process facilitated by standards such as the  <a href="http://www.w3.org/RDF/">Resource Description Framework (RDF)</a>.</p>
<p>Journals such as <a href="http://earth-system-science-data.net/">Earth System Science Data</a> and the <a href="http://www.ijrr.org/">International Journal of Robotics Research</a> give us a glimpse at how this approach might evolve in the future &#8212; from journals to data journals, publications which are concerned with presenting valuable data for reuse and pave the way for a research process that is increasingly collaborative. Technology is gradually catching up with the need for genuinely digital publications, a need fueled by the advantages of able to combine text, images, links, videos and a wide variety of datasets to produce a next-generation multi-modal scholarly article. Systems such as <a href="http://www.fedora-commons.org/">Fedora</a> and <a href="http://colab.mpdl.mpg.de/mediawiki/PubMan">PubMan</a> are meant to facilitate digital publishing and assure best-practice data provenance and storage. They are able to handle different types of data and associate any number of individual files with a &#8220;data paper&#8221; that documents them.</p>
<p>However, technology is the much smaller issue when weighing the  advantages of data publishing with its challenges &#8212; of which there are  many, both to practitioners and to those supporting them. Best practices  on the individual level are cultural norms that need to be established over time. Scientists still don&#8217;t have sufficient incentives to openly  share their data, as tenure processes are tied to publishing results <em>based  on</em> data, but not on <em>sharing data directly</em>. And finally,  technology is prone to failure when there are no agreed-upon standards  guiding its use and such standards need to be gradually (meaning  painfully slowly, compared with technology&#8217;s breakneck pace) established  accepted by scholars, not decreed by committee.</p>
<p>In March, <a href="http://sciencecommons.org/about/whoweare/rees/">Jonathan Rees</a> of <a href="http://neurocommons.org/page/Main_Page">NeuroCommons</a> (a  project within Creative Commons/Science Commons) published a <a href="http://neurocommons.org/report/data-publication.pdf">working  paper</a> that outlines such standards for reusable scholarly data. One  thing I really appreciate about Rees&#8217; approach is that it is remarkably  discipline-independent and not limited to the sciences (vs. social  science and the humanities).</p>
<p>Rees outlines how data papers differ from traditional papers:</p>
<blockquote><p>A data paper is a publication whose primary purpose is to <strong>expose and  describe data</strong>, as opposed to analyze and draw conclusions from it. The data paper enables <strong>a division of labor</strong> in which those possessing the resources and skills can perform the experiments and observations needed  to collect potentially interesting data sets, so that many parties, each with a unique  background and ability to analyze the data, may make use of it as they see fit.</p></blockquote>
<p>The key phrase here (which is why I couldn&#8217;t resist boldfacing it) is <em>division of labor</em>. Right now, to use an auto manufacturing analogy, a scholar does not just design a beautiful car (an analysis in the form of a research paper that culminates in observations or theoretical insights), he also has to build an engine (the data that his observations are based on). It doesn&#8217;t matter if she is a much better engineer than designer, the car will only run (she&#8217;ll only get tenure) if both the engine and the car meet the same requirements. The car analogy isn&#8217;t terribly fitting, but it serves to make the point that our current system lacks a division of labor, making it pretty inefficient. It&#8217;s based more on the idea of producing smart people than on the idea of getting smart people to produce reusable research.</p>
<p>Rees notes that data publishing is a complicated process and lists a set of rules for successful sharing of scientific data.</p>
<p>From the paper:</p>
<ol>
<li>The author must be professionally motivated to publish the data</li>
<li>The effort and economic burden of publication must be acceptable</li>
<li>The data must become accessible to potential users</li>
<li>The data must remain accessible over time</li>
<li>The data must be discoverable by potential users</li>
<li>The user’s use of the data must be permitted</li>
<li>The user must be able to understand what was measured and how (materials and methods)</li>
<li>The user must be able to understand all computations that were applied and their inputs</li>
<li>The user must be able to apply standard tools to all file formats</li>
</ol>
<p>At a glance, these rules signify very different things. #1 and #2 are preconditions, rather than prescriptions while #3 &#8211; #6 are concerned with what the author needs to do in order to make the data available. Finally, rules #7 &#8211; #10 are corned with making the data as useful to others as possible. Rules #7 -#10 are dependent on who &#8220;the user&#8221; is and qualify as &#8220;do-this-as-best-as-you-can&#8221;-style suggestions, rather than strict requirements, not because they aren&#8217;t important, but because it&#8217;s impossible for the author to guarantee their successful implementation. By contrast, #3 -#6 are concerned with providing and preserving access and are requirements &#8212; I can&#8217;t guarantee that you&#8217;ll understand (or agree with) my electronic dictionary on <a href="http://www.ethnologue.com/show_language.asp?code=khk">Halh Mongolian</a>, but I can make sure it&#8217;s stored in an institutional or disciplinary repository that is indexed in search engines, <a href="http://lockss.stanford.edu/lockss/Home">mirrored to assure the data can&#8217;t be lost</a> and licensed in a legally unambiguous way, rather that upload it to my personal website and hope for the best when it comes to long-term availability, ease of discovery and legal re-use.</p>
<p>Finally, Rees gives some good advice beyond tech issues to publishers who want to implement data publishing:</p>
<blockquote><p><strong>Set a standard.</strong> There won&#8217;t be investment in data set reusability unless granting agencies and tenure review boards see it as a legitimate activity. A journal that shows itself credible in the role of enabling reuse will be rewarded with submissions and citations, and will in turn reward authors by helping them obtain recognition for their service to the research community.</p></blockquote>
<p>This is critical. Don&#8217;t wait for universities, grant agencies or even scholars to agree on standards entirely on their own &#8212; they can&#8217;t and won&#8217;t if they don&#8217;t know how digital publishing works (legal aspects included). Start an innovative journal and set a standard yourself by being successful.</p>
<blockquote><p><strong>Encourage use of standard file formats, schemas, and ontologies.</strong> It is  impossible to know what file formats will be around in ten years, much  less a hundred, and this problem worries digital archivists. Open  standards such as XML, RDF/XML, and PNG should be encouraged. Plain text  is generally transparent but risky due to character encoding ambiguity.  File formats that are obviously new or exotic, that lack readily  available documentation, or that do not have non-proprietary parsers  should not be accepted. Ontologies and schemas should enjoy community  acceptance.</p></blockquote>
<p>An important suggestion that is entirely compatible with linguistic data (dictionaries, word lists, corpora, transcripts, etc) and simplified by the fact that we have comparably small datasets. Even a megaword corpus is small compared to climate data or gene banks.</p>
<blockquote><p><strong>Aggressively implement a clean separation of concerns</strong>. To encourage submissions and reduce the burden on authors and publishers, avoid the imposition of criteria not related to data reuse. These include importance (this will not be known until after others work with the data) and statistical strength (new methods and/or meta-analysis may provide it). The primary peer review criterion should be adequacy of experimental and computational methods description in the service of reuse.</p></blockquote>
<p>This will be a tough nut to crack, because it sheds tradition to a degree. Relevance was always high on the list of requirements while publications were scarce &#8212; paper costs money, therefor what was published had to important to as many people as possible. With data publishing this is no longer true &#8212; whether something is important or statistically strong (applying this to linguistics one might say representative, well-documented, etc) is impossible to know from the onset. It&#8217;s much more sensible to get it out there and deal with the analysis later, rather than creating an artificial scarcity of data. But it will take time and cultural change to get researchers (and funding both funding agencies and hiring committees) to adapt to this approach.</p>
<p>In the meantime, while we&#8217;re still publishing traditional (non-data) papers, we can at least work on making them more accessible. Something like <a href="http://arxiv.org">arXiv</a> for linguistics wouldn&#8217;t hurt.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/228/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>danah boyd: &#8220;It’s great that we have Big Data but we need to develop the intellectual apparatus to actually analyze it.&#8221;</title>
		<link>http://blog.ynada.com/217</link>
		<comments>http://blog.ynada.com/217#comments</comments>
		<pubDate>Wed, 21 Apr 2010 13:31:01 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[danah boyd]]></category>
		<category><![CDATA[methods]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[social science]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=217</guid>
		<description><![CDATA[Thanks to Lambert for pointing out this highly recommended piece by danah boyd to me. I like it so much that I&#8217;ve decided to assemble some favorite quotes. On interpreting (big) quantitative social science data: &#8220;Just because you see traces of data doesn’t mean you always know the intention or cultural logic behind them. And [...]]]></description>
			<content:encoded><![CDATA[<p>Thanks to <a href="http://wikify.org/">Lambert</a> for pointing out <a href="http://www.zephoria.org/thoughts/archives/2010/04/17/big-data-opportunities-for-computational-and-social-sciences.html">this highly recommended piece </a>by <a href="http://www.zephoria.org/">danah boyd</a> to me. I like it so much that I&#8217;ve decided to assemble some favorite quotes.</p>
<p>On interpreting (big) quantitative social science data:</p>
<blockquote><p>&#8220;Just because you see <strong>traces of data</strong> doesn’t mean you always know the  <strong>intention</strong> or <strong>cultural logic</strong> behind them.  And just because you have a  big N doesn’t mean that it’s representative or generalizable.&#8221;</p></blockquote>
<blockquote><p>&#8220;Many computational scientists believe that because they have large N  data that they know more about people’s practices than any other social  scientist.  Time and time again, I see computational scientists <strong>mistake behavioral traces for cultural logic</strong>.&#8221;</p></blockquote>
<blockquote><p>&#8220;Big Data is going to be extremely important but we can never lose track  of <strong>the context in which this data is produced </strong>and the cultural logic  behind its production.&#8221;</p></blockquote>
<p>On interdisciplinarity and methods:</p>
<blockquote><p>&#8220;Each methodology has its strength and weaknesses.  Each approach to data  has its strengths and weaknesses. Each theoretical apparatus has its  place in scholarship.  And one of the biggest challenges in doing  “interdisciplinary” work is being about to account for these  differences, to know what approach works best for what question, to know  what theories speak to what data and can be used in which ways.&#8221;</p></blockquote>
<p>Which is why working in interdisciplinary teams where people really listen to each other is so important. Which is why learning beyond gradschool is so important.</p>
<p>On funding agencies and interdisciplinarity:</p>
<blockquote><p>&#8220;I actually think that the funding agencies are going to play a huge role in this, not just in demanding cross-disciplinary collaboration, but in setting the stage for how research will be published.&#8221;</p></blockquote>
<p>This is an important point &#8212; and one where I wonder whether the situation over here in Germany isn&#8217;t more difficult than in the U.S. Funding agencies over here are incredibily reluctant to make demands to researchers. This has both upsides and downsides, a downside being that there are fewer incentives to cooperate.</p>
<p>On social scienctists and computational scientists joining forces to approach Big Data:</p>
<blockquote><p>&#8220;[..]every discipline has its arrogance and far too many scholars think that they know everything. We desperately need a little humility here.&#8221;</p></blockquote>
<p>Amen. And, interestingly enough, I sense a connection between danah&#8217;s argument and <a href="http://gunnarsohn.wordpress.com/2010/01/31/netznavigator-herder-statt-schirrmacher/">Frank Schirrmacher&#8217;s views</a>:</p>
<blockquote><p>Die Informatiker müssen aus den Nischen in die Mitte der Gesellschaft geholt werden. Sie müssen die Scripts erklären, nach denen wir handeln und bewertet werden. Was ist voraussagende Suche und was kann sie? Was ist „profiling“? Wer liest uns, während wir lesen? Technologien sind neutral, es kommt darauf an, wie wir sie benutzen. Um das zu können, brauchen wir Dolmetscher aus der technologischen Intelligenz.</p></blockquote>
<p>Interestingly enough, danah is the one who&#8217;s more critical. Schirrmacher (who isn&#8217;t talking about Big Data, but about digital technology in general and about it&#8217;s impact on society) demands that computational scientists explain their code to the public &#8212; what ranking algorithms do and how context-sensitive ads work. danah criticizes drawing conclusions from automated computational analysis without taking other methods into account. If we start out with simplistic assumptions (e.g. &#8220;the people we spend the most time with are the ones closest to us&#8221;) we are prone to drawing entirely wrong conclusions, even if our data is beautifully modeled.</p>
<p>I could go on and on here why danah is spot-on here, but instead I&#8217;ll just point <a href="http://www.zephoria.org/thoughts/archives/2010/04/17/big-data-opportunities-for-computational-and-social-sciences.html">to the piece itself</a> again.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/217/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
