I’ve been following the development of googleVis, the implementation of the Google Visualization API for R, for a bit now. The library has a lot of potential as a bridge between R (where data processing happens) and HTML (where presentation is [increasingly] happening). A growing number of visualization frameworks are on the market and all have their perks (e.g. Many Eyes, Simile, Flare). I guess I was inspired in such a way by the Hans Rosling Show TED talk that makes such great use of bubble charts that I wanted to try the Google Vis API for that chart type alone. There’s more, however, if you don’t care much for floating blubbles: neat chart variants include the geochart, area charts and the usual classics (bar, pie, etc). Check out the chart gallery for an overview.

So here are my internet growth charts:

(1) Motion chart showing the growth of the global internet population since 2000 for 208 countries

(2) World map showing global internet user statistics for 2009 for 208 countries

Data source: data.un.org (ITU database). I’ve merged two tables from the database into one (absolute numbers and percentages) and cleaned the data up a bit. The resulting tab-separated CSV file is available here.

And here’s the R code for rendering the chart. Basically you just replace gvisMotionChart() with gvisGeoChart() for the second chart, the rest is the same.

n <- read.csv("netstats.csv", sep="\t")
nmotion <- gvisMotionChart(n, idvar="Country", timevar="Year", options=list(width=1024, height=768))
Tagged with:  

How relevant is data literacy?

On March 10, 2011, in Thoughts, by cornelius

Two independent trajectories have prompted me to think about data literacy and its relevance lately. I’ll focus specifically on social data in the rest of this post, i.e. the information we generate on Facebook and similar services, though I think there are cases where these ideas may apply to other kinds of data as well.

In late February I attended the Cognitive Cities Conference, an event about the digital future of urbanity. Many presentations at CoCities incorporated statistics and flashy visualizations (traffic patterns, the journey of household trash to a landfill), and the importance of data was a recurring theme. It seemed to me like there was a slight uneasiness among the speakers in the face of the huge projection (which showed a colorful rendition of the presenter’s face at the beginning of each talk) and the ultramodern, Arduino-lit installation on the podium, activated by the speaker’s voice. Awe of such digital embellishments was mixed with embarrassment: Please, I’m not nearly as cool as that thing makes me look, many speakers seemed to say. Their reaction reflected a lingering consciousness of the risks posed by uncritical techno-fetishism that characterized the event for me. The digital future of cities, it became clear in the course of the two-day conference, will be intricately linked to our own future. Will we be smart mobs (or, even better, smart individuals), or dumb blobs of data, waiting to be mined by companies and government bureaucracies? Will we program or be programed?

Ton Zijlstra speaking at the Cognitive Cities Conference

Ton Zijlstra speaking at the Cognitive Cities Conference

One commentator aptly pointed out that a visualization of bike travel patterns in New York City didn’t really reveal anything a local wouldn’t know without rendering a graph, but the futurists were undeterred — and believe me when I say that I totally get why. All this data we all generate — whether it means something or not — can be analyzed, mined, visualized and repackaged in sophisticated rhetorical pastiches that blur the boundary between information and art. Data is being used to sell products, frame political statements and make scientific arguments. It is used to get insane valuations from investors, valuations ultimately based on the assumption that in the digital future, human behavior will be predictable in ways previously unimaginable. If code is law, data is capital.

The persuasiveness of digital data is owed to its degree of abstraction. The visualization of a set of data is a Russian doll of abstraction. It’s an interpretation based on implicit assumptions (What is highlighted? What is left out?), and on something (data) that also has a fluid and subjective relation to the world (What are friends on Facebook? Is there any relation between real friends and Facebook friends?). The raison d’être of social data is that something or someone external to us has generated it, making it seemingly superior evidence to our personal intuitions. But the frame in which the behavior takes place that the data perpetrates to describe conditions the possible options. The existence of a relationship status field makes the question of whether 500 million people are single or in a relationship (and whether their relationship is complicated) a public issue. By asking the question you’re conditioning the answer.

Dietmar Offenhuber (MIT) maps immigrant phone call patterns in NYC

Dietmar Offenhuber (MIT) maps immigrant phone call patterns in NYC

A second trajectory is the work on Twitter hashtag datasets we do in Düsseldorf as part of the Junior Researchers Group “Science and the Internet”. We’ve been using graph analysis and other procedures to figure out who is talking to whom and what’s being retweeted. The recent shutdown of TwapperKeeper has forced us to find our own custom solution for archiving tweets. In the process of looking for a fix, I discovered Amazon AWS and experimented with cloud-based data collection. I was up until 4am last night because I was so fascinated by the ability to launch a highly customized virtual server at the click of a button. Geeky as that may be, virtualization really empowers developers. It used to be that you needed access to a physical server for this kind of data collection — perhaps an old machine sitting in your office running 24/7, or, if you were a bit more professional, a machine provide by your university’s computing services. Or you could rent a commercial server, assuming you could afford it. But you couldn’t just click “launch instance”. You had to handle your resources carefully.

Not anymore. Not only is “web space” cheap or free (that happened a few years ago), but virtual computing power has become a commodity that you can use in a flexible way to do whatever you want to get done — collect data, do complex computations, anything. The one barrier that remains between the individual and this kind of digital self-empowerment is data literacy (in the connected world, that is, which means by no means everywhere). It is hard to imagine a future where those who are literate will not have a significant advantage over those who aren’t, because that barrier is unlikely to disappear as rapidly as economic hurdles are.

My take on this is not entirely positive. The increasing semantification of digital information and ubiquity of data makes arguments based on data and communicated via visualizations increasingly popular. Data-based argumentation can be deceitful or built on false premises, just like any other form of rhetoric. Data literacy must therefor not only be concerned with the technical dimension of data usage, but also with a critical reflection of the data’s relationship to the world. Add to this questions of ownership (Whose data is it?), control (Is the data being used to make inferences about people without their knowledge?) and trust (Are you dealing with a reliable data source?) and you have a rough sketch of what data literacy might look like.

Data literacy mind map. What's missing?

Data literacy mind map. What's missing?

Should we start teaching this stuff in school, as for example Adam Greenfield suggests? Or is data literacy a technocrat’s pipe dream, touted in order to make something appear universally relevant that really concerns only a small group of nerds?

Are our visualizations the ghosts from outer space that author Warren Ellis conjured in his closing speech at CoCities, phantasms that pretend to signify something, but ultimately mean nothing? Let me know what you think.

Tagged with:  

URLs tweeted at #THATCamp (all 230 of them)

On May 24, 2010, in data, by cornelius

I’ve data-mined the #thatcamp hashtag a bit more and extracted all 230 links that were tweeted recently (also includes some of THATCamp Paris). Enjoy :-)

(or go here to view the table inside Google Docs)

Tagged with:  

One week of Scientwist tweeting (January 18 – 25)

On January 27, 2010, in data, by cornelius

Here’s a list of URLs and hashtags that were popular among the @scientwists community last week. I realize that this is just a long enumeration, but I’m planning to publish these stats in a more concise format in the near future.

January 18th

January 19th

January 20th

January 21st

January 22nd

January 23rd

January 24th

January 25th

Tagged with:  

Since starting the Scientwists Project a bit over a week ago, I’ve been busy hacking up Bash and R scripts in order to analyze the data produced by the 500+ scholars that I’m following. Here’s a first glimpse of what they’ve been tweeting about, specifically the URLs and hashtags they’ve used.

In total, I’ve collected about 12.000 tweets since January 7th, containing 4.750 different URLs and 1.130 different hashtags.

10 most popular URLs

1. The Shorty Awards

2. Dennis Meadows: The Oil Drum: Economics and Limits to Growth: What’s Sustainable?

3. Björn Brembs: Social filtering of scientific information – a view beyond Twitter

4. BioData Product Blog: Laboratory Notebooks: A thing of the past?

5. Forbes.com: Illumina’s Cheap New Gene Machine

6. A photograph of clouds that seem to resemble Great Britain :-)

7. Times Online: Baroness Greenfield loses her job in Royal Institution shake-up

8. Mr. Gunn: Cell launches a new format for the presentation of research articles online

9. Daniel Mietchen: On the need for a global academic internet platform [ref to Nadja Kutz: arxiv.org/abs/0803.1360]

10. Rebecca Skloot: The Immortal Life of Henrietta Lacks

These were tweeted between 5 (#9 and #10) and 30 (#1) times. However, tracking URLs is complicated by the fact that many different addresses may point to the same source, especially since people use a variety of different URL shorteners. This is something I’ll resolve later, so for now this fairly anecdotal.

15 most popular hashtags

1. #scio10 (391x)
2. #scidebate (84x)
3. #fb (75x)
4. #science (68x)
5. #technology (67x)
6. #tcot (58x)
7. #orca (54x)
8. #debateanatel (53x)
9. #Glee (31x)
10. #ff (27x)
11. #HeLa (26x)
12. #uksnow (26x)
13. #Haiti (25x)
14. #NetDE (24x)
15. #gov20 (21x)

Obviously some of these are automatically generated (#fb and #ff), but there’s a fair share of interesting ones. I’m expecting #scio10 will dominate the next few days even more visibly.

Hope it’s informative – let me know if you have any questions. :-)