Ahead of publishing my TwitterFunctions library of R code (which is constant work in progress) I thought I’d put up some really short Python code for getting a person’s friends and followers. Both scripts rely on Tweepy, my favorite Python implementation of the Twitter API. Install Python (works on Windows as well, not just on Mac/Linux) and then Tweepy on top of that and you are good to go with these two scripts, which can be executed from the command line with
python get_friends.py username

1
2
3
4
5
6
import sys
import tweepy
 
user = sys.argv[1]
for friend in tweepy.api.friends(user):
	print friend.screen_name
1
2
3
4
5
6
import sys
import tweepy
 
user = sys.argv[1]
for follower in tweepy.api.followers(user):
	print follower.screen_name
Tagged with:  

Visualizing text: theory and practice

On May 19, 2010, in Thoughts, by cornelius

Note: I’ve also posted this on thatcamp.org.

Bad, bad me — of course I’ve been putting off writing up my ideas and thoughts for THATcamp almost to the latest possible moment. Waiting so long has one definitive advantage though: I get to point to some of the interesting suggestions that have already been posted here and (hopefully) add to them.

I’d like to both discuss and do text visualization. Charts, maps, infographics and other forms of visualization are becoming increasingly popular as we are faced with large quantities of textual data from a variety of sources. To linguists and literary scholars, visualizing texts can (among other things) be interesting to uncover things about language as such (corpus linguistics) and about individual texts and their authors (narratology, stylometrics, authorship attribution), while to a wide range of other disciplines the things that can be inferred from visualization (social change, spreading of cultural memes) beyond the text itself can be interesting.

What can we potentially visualize? This may seem to be a naive question, but I believe that only by trying out virtually everything we can think of (distribution of letters, words, word classes, n-grams, paragraphs, …; patterning of narrative strands, structure of dialog, occurrence of specific rhetorical devices; references to places, people, points in time…; emotive expressions, abstract verbs, dream sequences… you name it) can we reach conclusions about what (if anything!) these things might mean.

How can we visualize text? If we consider for a moment how we mostly visualize text today it quickly becomes apparent that there is much more we could be doing. Bar plots, line graphs and pie charts are largely instruments for quantification, yet very often quantitative relations between elements aren’t our only concern when studying text. Word clouds add plasticity, yet they eliminate the sequential patterning of a text and thus do not represent its rhetorical development from beginning to end. Trees and maps are interesting in this regard, but by and large we hardly utilize the full potential of visualization as a form of analysis, for example by using lines, shapes, color (!) and beyond that, movement (video) in a way that suits the kind of data we are dealing with.

What tools can we use to do visualization? I’m very interested in Processing and have played with it, also more extensively with R and NLTK/Python. Tools for rendering data, such as Google Chart Tools, igraph and RGraph are also interesting. Other, non-statistical tools are also an option: free hand drawing tools and web-based services like Many Eyes. Visualization doesn’t need to be restricted to computation/statistics. Stephanie Posavec‘s trees are a dynamic mix of automation and manual annotation and demonstrate that visualizations are rhetorically powerful interpretations themselves.

I hope that some of the abovementioned things connect to other THATcampers’ ideas, e.g. Lincoln Mullen’s post on mining scarce sources and Bill Ferster’s post on teaching using visualization.

Don’t get me started on the potential for teaching. Ultimately translating a text into another form is a unique kind of critical engagement: you’re uncovering, interpreting and making an argument all at once, both to the text in question and to yourself.

Anyway — anything from discussing theoretical issues of visualization to sharing code snippets would fit into this session and I’m looking forward to hearing other campers’ thoughts and experiences on the subject.

Tagged with:  

Thanks to Lambert for pointing out this highly recommended piece by danah boyd to me. I like it so much that I’ve decided to assemble some favorite quotes.

On interpreting (big) quantitative social science data:

“Just because you see traces of data doesn’t mean you always know the intention or cultural logic behind them. And just because you have a big N doesn’t mean that it’s representative or generalizable.”

“Many computational scientists believe that because they have large N data that they know more about people’s practices than any other social scientist. Time and time again, I see computational scientists mistake behavioral traces for cultural logic.”

“Big Data is going to be extremely important but we can never lose track of the context in which this data is produced and the cultural logic behind its production.”

On interdisciplinarity and methods:

“Each methodology has its strength and weaknesses. Each approach to data has its strengths and weaknesses. Each theoretical apparatus has its place in scholarship. And one of the biggest challenges in doing “interdisciplinary” work is being about to account for these differences, to know what approach works best for what question, to know what theories speak to what data and can be used in which ways.”

Which is why working in interdisciplinary teams where people really listen to each other is so important. Which is why learning beyond gradschool is so important.

On funding agencies and interdisciplinarity:

“I actually think that the funding agencies are going to play a huge role in this, not just in demanding cross-disciplinary collaboration, but in setting the stage for how research will be published.”

This is an important point — and one where I wonder whether the situation over here in Germany isn’t more difficult than in the U.S. Funding agencies over here are incredibily reluctant to make demands to researchers. This has both upsides and downsides, a downside being that there are fewer incentives to cooperate.

On social scienctists and computational scientists joining forces to approach Big Data:

“[..]every discipline has its arrogance and far too many scholars think that they know everything. We desperately need a little humility here.”

Amen. And, interestingly enough, I sense a connection between danah’s argument and Frank Schirrmacher’s views:

Die Informatiker müssen aus den Nischen in die Mitte der Gesellschaft geholt werden. Sie müssen die Scripts erklären, nach denen wir handeln und bewertet werden. Was ist voraussagende Suche und was kann sie? Was ist „profiling“? Wer liest uns, während wir lesen? Technologien sind neutral, es kommt darauf an, wie wir sie benutzen. Um das zu können, brauchen wir Dolmetscher aus der technologischen Intelligenz.

Interestingly enough, danah is the one who’s more critical. Schirrmacher (who isn’t talking about Big Data, but about digital technology in general and about it’s impact on society) demands that computational scientists explain their code to the public — what ranking algorithms do and how context-sensitive ads work. danah criticizes drawing conclusions from automated computational analysis without taking other methods into account. If we start out with simplistic assumptions (e.g. “the people we spend the most time with are the ones closest to us”) we are prone to drawing entirely wrong conclusions, even if our data is beautifully modeled.

I could go on and on here why danah is spot-on here, but instead I’ll just point to the piece itself again.

Tagged with: