<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cornelius Puschmann&#039;s Blog &#187; statistics</title>
	<atom:link href="http://blog.ynada.com/tag/statistics/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.ynada.com</link>
	<description>My new blog on Linguistics, Digital Humanities and Scholarly Communication on the Internet</description>
	<lastBuildDate>Wed, 18 Jan 2012 17:54:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Post-event Twitter stats for #THATcamp</title>
		<link>http://blog.ynada.com/265</link>
		<comments>http://blog.ynada.com/265#comments</comments>
		<pubDate>Wed, 26 May 2010 16:08:17 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[data]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[THATcamp]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=265</guid>
		<description><![CDATA[I thought I&#8217;d post an updated version of the simple stats on Twitter activity presented here. The data in the older post was collected before THATcamp took place, the graphs below show the activity during and after the camp. The tweets I&#8217;ve collected are also available here (my own file) and on TwapperKeeper. Tweets over [...]]]></description>
			<content:encoded><![CDATA[<p>I thought I&#8217;d post an updated version of the simple stats on Twitter activity presented <a href="http://thatcamp.org/2010/twitterstat/">here</a>. The data in the older post was collected before <a href="http://thatcamp.org">THATcamp</a> took place, the graphs below show the activity during and after the camp.</p>
<p>The tweets I&#8217;ve collected are also available <a href='http://blog.ynada.com/wp-content/uploads/2010/05/thatcamp-all.zip'>here</a> (my own file) and on <a href="http://www.twapperkeeper.com/hashtag/thatcamp">TwapperKeeper</a>.</p>
<p class"clear: both" />
<p><strong>Tweets over time (roughly 14th of May to 24th)</strong></p>
<p><a href="http://blog.ynada.com/wp-content/uploads/2010/05/tc-time2.png"><img src="http://blog.ynada.com/wp-content/uploads/2010/05/tc-time2-300x187.png" alt="" title="tc-time2" width="300" height="187" class="aligncenter size-medium wp-image-276" /></a></p>
<p class"clear: both" />
<p><strong>Most active users</strong></p>
<p><a href="http://blog.ynada.com/wp-content/uploads/2010/05/tc-activity.png"><img src="http://blog.ynada.com/wp-content/uploads/2010/05/tc-activity-300x175.png" alt="" title="tc-activity" width="300" height="175" class="aligncenter size-medium wp-image-266" /></a></p>
<p class"clear: both" />
<p><strong>Most @-messaged users</strong></p>
<p><a href="http://blog.ynada.com/wp-content/uploads/2010/05/tc-ats.png"><img src="http://blog.ynada.com/wp-content/uploads/2010/05/tc-ats-300x175.png" alt="" title="tc-ats" width="300" height="175" class="aligncenter size-medium wp-image-267" /></a></p>
<p class"clear: both" />
<p><strong>Most retweeted users</strong></p>
<p><a href="http://blog.ynada.com/wp-content/uploads/2010/05/tc-rts.png"><img src="http://blog.ynada.com/wp-content/uploads/2010/05/tc-rts-300x175.png" alt="" title="tc-rts" width="300" height="175" class="aligncenter size-medium wp-image-268" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/265/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A first glimpse at my Twitter corpus</title>
		<link>http://blog.ynada.com/106</link>
		<comments>http://blog.ynada.com/106#comments</comments>
		<pubDate>Tue, 14 Jul 2009 13:57:04 +0000</pubDate>
		<dc:creator>cornelius</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://blog.ynada.com/?p=106</guid>
		<description><![CDATA[I&#8217;ve finally found the time to do some initial number-crunching on my Twitter corpus in preparation for my presentation at IR 10.0. In this post, I&#8217;ll document some very (!) basic first observations, all of which are work in progress, but will probably show up in a published paper at some point. The to-date research [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve finally found the time to do some initial number-crunching on my Twitter corpus in preparation for my presentation at IR 10.0. In this post, I&#8217;ll document some very (!) basic first observations, all of which are work in progress, but will probably show up in a published paper at some point.</p>
<p>The to-date research by <a href="http://ella.slis.indiana.edu/~herring/honeycutt.herring.2009.pdf">Honeycut/Herring</a> and <a href="http://www.danah.org/papers/TweetTweetRetweet.pdf">boyd/Golder/Lotan</a> has already examined different aspects of Twitter language thoroughly (retweeting, @-messaging), but I hope to add insight in a few areas by examining facets that so far have been explored less, namely the relation of tweeting to other forms of <a href="http://en.wikipedia.org/wiki/Computer-mediated_communication">CMC</a> and uses of Twitter which don&#8217;t involve retweeting and messaging, but can be regarded as more introspective and (to me) &#8220;blog-like&#8221;.</p>
<p>I&#8217;ll start with some very basic stuff that won&#8217;t really be too surprising to most of you.</p>
<p>First the corpus data:</p>
<p>a) extract from my larger Twitter corpus (Twitter_SmallCorp)<br />
Size: 1,932,772 tokens / 149,292 types</p>
<p>b) <a href="http://faculty.nps.edu/cmartell/NPSChat.htm">NPS Chat corpus</a><br />
Size: 45,010 tokens / 6,066 types</p>
<p>c) Webtext corpus included in the <a href="http://www.nltk.org/Home">NLTK</a><br />
Size: 396,736 tokens / 21,537 types</p>
<p>My three corpora differ drastically in size and by the standards of most computational linguists all three are pretty small. It&#8217;s my impression, however, that top X word lists (e.g. top 50) will not change significantly beyond 6-digit numbers, but tend to show characteristic distribution patterns for a given genre. Wordlists are too ambiguous to really identify a type of text reliably, but looking at them can still be interesting.</p>
<p>The table below shows the top 50 types in all three corpora ranked by frequency. I deliberately don&#8217;t provide unnormalized counts in the table as they would be fairly meaningless &#8211; the rank is what I&#8217;m interested in.</p>
<table border="1" cellspacing="4" cellpadding="4" frame="void" rules="none">
<tbody>
<tr>
<td height="17" align="left"></td>
<td align="left">Twitter_SmallCorp</td>
<td align="left">NPS_Chat</td>
<td align="left">Webtext</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eeeeee">1</td>
<td align="left" bgcolor="#eeeeee">I</td>
<td align="left" bgcolor="#eeeeee">lol</td>
<td align="left" bgcolor="#eeeeee">I</td>
</tr>
<tr>
<td height="17" align="left">2</td>
<td align="left">the</td>
<td align="left">to</td>
<td align="left">the</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eeeeee">3</td>
<td align="left" bgcolor="#eeeeee">to</td>
<td align="left" bgcolor="#eeeeee">i</td>
<td align="left" bgcolor="#eeeeee">to</td>
</tr>
<tr>
<td height="17" align="left">4</td>
<td align="left">a</td>
<td align="left">the</td>
<td align="left">a</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eeeeee">5</td>
<td align="left" bgcolor="#eeeeee">and</td>
<td align="left" bgcolor="#eeeeee">you</td>
<td align="left" bgcolor="#eeeeee">you</td>
</tr>
<tr>
<td height="17" align="left">6</td>
<td align="left">of</td>
<td align="left">I</td>
<td align="left">in</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eeeeee">7</td>
<td align="left" bgcolor="#eeeeee">for</td>
<td align="left" bgcolor="#eeeeee">a</td>
<td align="left" bgcolor="#eeeeee">and</td>
</tr>
<tr>
<td height="17" align="left">8</td>
<td align="left">in</td>
<td align="left">hi</td>
<td align="left">on</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eeeeee">9</td>
<td align="left" bgcolor="#eeeeee">you</td>
<td align="left" bgcolor="#eeeeee">me</td>
<td align="left" bgcolor="#eeeeee">of</td>
</tr>
<tr>
<td height="17" align="left">10</td>
<td align="left">is</td>
<td align="left">is</td>
<td align="left">is</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eeeeee">11</td>
<td align="left" bgcolor="#eeeeee">it</td>
<td align="left" bgcolor="#eeeeee">in</td>
<td align="left" bgcolor="#eeeeee">it</td>
</tr>
<tr>
<td height="17" align="left">12</td>
<td align="left">s</td>
<td align="left">and</td>
<td align="left">not</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eeeeee">13</td>
<td align="left" bgcolor="#eeeeee">on</td>
<td align="left" bgcolor="#eeeeee">it</td>
<td align="left" bgcolor="#eeeeee">that</td>
</tr>
<tr>
<td height="17" align="left">14</td>
<td align="left">my</td>
<td align="left">that</td>
<td align="left">with</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eee">15</td>
<td align="left" bgcolor="#eeeeee">that</td>
<td align="left" bgcolor="#eeeeee">hey</td>
<td align="left" bgcolor="#eeeeee">for</td>
</tr>
<tr>
<td height="17" align="left">16</td>
<td align="left">n&#8217;t</td>
<td align="left">my</td>
<td align="left">Girl</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eee">17</td>
<td align="left" bgcolor="#eeeeee">have</td>
<td align="left" bgcolor="#eeeeee">of</td>
<td align="left" bgcolor="#eeeeee">2</td>
</tr>
<tr>
<td height="17" align="left">18</td>
<td align="left">with</td>
<td align="left">u</td>
<td align="left">Guy</td>
</tr>
<tr>
<td height="17" align="left" bgcolor="#eee">19</td>
<td align="left" bgcolor="#eeeeee">at</td>
<td align="left" bgcolor="#eeeeee">s</td>
<td align="left" bgcolor="#eeeeee">when</td>
</tr>
<tr>
<td height="17" align="left">20</td>
<td align="left">me</td>
<td align="left">for</td>
<td align="left">like</td>
</tr>
</tbody>
</table>
<p>There&#8217;s a lot of uniformity at first glance if you compare the three lists. However, the comparison of Twitter with Web chat shows some interesting (though largely unsurprising) differences:</p>
<ol>
<li>the first person pronoun and determiner (<em>I, me, my</em>) are used frequently in all three corpora, but Twitter seems to have a slight lead</li>
<li>the second person (<em>you</em>) is less frequent in Twitter than in the other corpora</li>
<li>greetings and emotives (<em>hi, hey, lol</em>) are frequent in chat, but occur much less frequently in Twitter</li>
<li>words expressing relations (<em>and, of</em>) are significantly more frequent in Twitter than in chat</li>
</ol>
<p>In contrast to chat, Twitter is generally used in a (more) asynchronous fashion, which provides motivation for (1) &#8211; (4). Going hand in hand with this is the lack of cospatiality (<em>virtual</em> cospatiality, that is &#8211; obviously there&#8217;s generally no <em>real</em> cospatiality online) &#8211; Twitter does not evoke the image of a &#8220;room&#8221; or shared space as do most chats. Depending on my Twitter client, I can only see what my followers are writing, but not my own tweets and direct messages, at least not in the same window. Finally, the participant structure is open and opaque &#8211; I may not know the participants in a chat personally, but I can identify them individually. Unless my updates are protected, anyone <em>can</em> potentially read my tweets and it is at the same time less obvious that anyone <em>will</em> necessarily read them. This is a situation comparable to that of blogs and it explains the lesser degree of linguistically enacted performance in Twitter vs. chats and the higher degree of propositional language. Everyone controls his/her own discourse environment in Twitter and accordingly less expressions are used that relate the speaker to others (2 and 3), while more are used that include the Twitterer (1).</p>
<p>Below are three plots showing the cumulative type distributions in each corpus. Note that they are rough and contain noise and punctuation.</p>

<a href='http://blog.ynada.com/106/twitter_tsmallcorp_top50cumulativepng' title='Top 50 types in Twitter_SmallCorp'><img width="150" height="150" src="http://blog.ynada.com/wp-content/uploads/2009/07/twitter_tsmallcorp_top50cumulativepng-150x150.png" class="attachment-thumbnail" alt="Top 50 types in Twitter_SmallCorp" title="Top 50 types in Twitter_SmallCorp" /></a>
<a href='http://blog.ynada.com/106/chat_nps_top50cumulativepng' title='Top 50 types in the NPS chat corpus'><img width="150" height="150" src="http://blog.ynada.com/wp-content/uploads/2009/07/chat_nps_top50cumulativepng-150x150.png" class="attachment-thumbnail" alt="Top 50 types in the NPS chat corpus" title="Top 50 types in the NPS chat corpus" /></a>
<a href='http://blog.ynada.com/106/web_webtext_top50cumulativepng' title='Top 50 types in the Webtext corpus'><img width="150" height="150" src="http://blog.ynada.com/wp-content/uploads/2009/07/web_webtext_top50cumulativepng-150x150.png" class="attachment-thumbnail" alt="Top 50 types in the Webtext corpus" title="Top 50 types in the Webtext corpus" /></a>

<p>I&#8217;m still just scratching the surface here, but a comparison of verbs and verb classes using larger Twitter, blog and chat corpora will come next. I&#8217;ll also look at tweets on the discourse level, specifically at (for lack of a better word) &#8220;non-commnuicative tweets&#8221;, i.e. those which are not RTs and not @-messages. Stay tuned. <img src='http://blog.ynada.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ynada.com/106/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

