Jul 14

I’ve finally found the time to do some initial number-crunching on my Twitter corpus in preparation for my presentation at IR 10.0. In this post, I’ll document some very (!) basic first observations, all of which are work in progress, but will probably show up in a published paper at some point.

The to-date research by Honeycut/Herring and boyd/Golder/Lotan has already examined different aspects of Twitter language thoroughly (retweeting, @-messaging), but I hope to add insight in a few areas by examining facets that so far have been explored less, namely the relation of tweeting to other forms of CMC and uses of Twitter which don’t involve retweeting and messaging, but can be regarded as more introspective and (to me) “blog-like”.

I’ll start with some very basic stuff that won’t really be too surprising to most of you.

First the corpus data:

a) extract from my larger Twitter corpus (Twitter_SmallCorp)
Size: 1,932,772 tokens / 149,292 types

b) NPS Chat corpus
Size: 45,010 tokens / 6,066 types

c) Webtext corpus included in the NLTK
Size: 396,736 tokens / 21,537 types

My three corpora differ drastically in size and by the standards of most computational linguists all three are pretty small. It’s my impression, however, that top X word lists (e.g. top 50) will not change significantly beyond 6-digit numbers, but tend to show characteristic distribution patterns for a given genre. Wordlists are too ambiguous to really identify a type of text reliably, but looking at them can still be interesting.

The table below shows the top 50 types in all three corpora ranked by frequency. I deliberately don’t provide unnormalized counts in the table as they would be fairly meaningless – the rank is what I’m interested in.

Twitter_SmallCorp NPS_Chat Webtext
1 I lol I
2 the to the
3 to i to
4 a the a
5 and you you
6 of I in
7 for a and
8 in hi on
9 you me of
10 is is is
11 it in it
12 s and not
13 on it that
14 my that with
15 that hey for
16 n’t my Girl
17 have of 2
18 with u Guy
19 at s when
20 me for like

There’s a lot of uniformity at first glance if you compare the three lists. However, the comparison of Twitter with Web chat shows some interesting (though largely unsurprising) differences:

  1. the first person pronoun and determiner (I, me, my) are used frequently in all three corpora, but Twitter seems to have a slight lead
  2. the second person (you) is less frequent in Twitter than in the other corpora
  3. greetings and emotives (hi, hey, lol) are frequent in chat, but occur much less frequently in Twitter
  4. words expressing relations (and, of) are significantly more frequent in Twitter than in chat

In contrast to chat, Twitter is generally used in a (more) asynchronous fashion, which provides motivation for (1) – (4). Going hand in hand with this is the lack of cospatiality (virtual cospatiality, that is – obviously there’s generally no real cospatiality online) – Twitter does not evoke the image of a “room” or shared space as do most chats. Depending on my Twitter client, I can only see what my followers are writing, but not my own tweets and direct messages, at least not in the same window. Finally, the participant structure is open and opaque – I may not know the participants in a chat personally, but I can identify them individually. Unless my updates are protected, anyone can potentially read my tweets and it is at the same time less obvious that anyone will necessarily read them. This is a situation comparable to that of blogs and it explains the lesser degree of linguistically enacted performance in Twitter vs. chats and the higher degree of propositional language. Everyone controls his/her own discourse environment in Twitter and accordingly less expressions are used that relate the speaker to others (2 and 3), while more are used that include the Twitterer (1).

Below are three plots showing the cumulative type distributions in each corpus. Note that they are rough and contain noise and punctuation.

I’m still just scratching the surface here, but a comparison of verbs and verb classes using larger Twitter, blog and chat corpora will come next. I’ll also look at tweets on the discourse level, specifically at (for lack of a better word) “non-commnuicative tweets”, i.e. those which are not RTs and not @-messages. Stay tuned. :-)

Comments are closed.

preload preload preload