Being here in Cologne at the moment for the 5th International Conference on e-Social Science makes me realize how solitary finishing my dissertation last winter really was by comparison. Not that writing and thinking in solitude for a sustained period of time is a bad thing, but it’s still great to connect to others doing similar research and to test your ideas in a public forum on a regular basis.
My presentation, somewhat akin to the one I gave in Spring at the VKS, was concerned with aspects of digital (scholarly) communication on the Net and quite ”conceptual”. In other words, I did not present the results of finished research or a systematic proposal, but instead applied more general ideas from linguistics and research into Web 2.0 to scholarship and scholarly publishing practices. I thought the response was quite positive and the imput will be helpful for my proposed research – a larger study of digital scholarly communication in several humanities and social sciences disciplines.
Below are the slides.
Thank you to Julian Newman and Esther Breuer for organizing the session and to the other presenters and attendees for a thought-provoking discussion. Ping at Nick Jankowski, Kirsten Schindler, Michael König, Janelle Ward and Kathryn Eccles with whom I had a wonderful chat about history, linguistics, academia and a plethora of other topics over post-workshop coffee.
Digital Humanities 2009 is in full swing at the moment and I’ve regretted more than once that I wasn’t able to fit that event into my schedule this year. But alas, there’s good news: DH 2010 will take place much close to home, at King’s College in London, and I won’t miss the chance to be a part of this exciting event the next time around.
# NLTK code for building a corpus of Twitter messages (or any number of text files in a dir)
import glob, os
path = ‘E:\Corpora\Twitter\plaintext\mostfrequent’
for infile in glob.glob (os.path.join(path, ‘*.txt’) ):
f = open(infile)
raw = raw + ‘ ‘ + f.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
All products of information technology —paintings and poems, novels and newspapers, movies and music — have been static since our ancestors first scratched diagrams in the dirt or pressed visions of their world on the walls of caves. Other human hands could add or destroy, but the products of our hands could do nothing but decay, prey to the scorching sun, the worm, or the slow fires of acid within. We can direct our questions to the written word or to the most lifelike painting, but we can expect only silence. Now, however, we have created cultural products that can respond, systems that can change and adapt themselves to our needs.
Writing, Phaedrus, has this strange quality, and is very like painting; for the creatures of painting stand like living beings, but if one asks them a question, they preserve a solemn silence. And so it is with written words; you might think they spoke as if they had intelligence, but if you question them, wishing to know about their sayings, they always say only one and the same thing.
But when they came to letters, This, said Theuth, will make the Egyptians wiser and give them better memories; it is a specific both for the memory and for the wit. Thamus replied: O most ingenious Theuth, the parent or inventor of an art is not always the best judge of the utility or inutility of his own inventions to the users of them. And in this instance, you who are the father of letters, from a paternal love of your own children have been led to attribute to them a quality which they cannot have; for this discovery of yours will create forgetfulness in the learners’ souls, because they will not use their memories; they will trust to the external written characters and not remember of themselves. The specific which you have discovered is an aid not to memory, but to reminiscence, and you give your disciples not truth, but only the semblance of truth; they will be hearers of many things and will have learned nothing; they will appear to be omniscient and will generally know nothing; they will be tiresome company, having the show of wisdom without the reality.
[W]ho should leave in writing or receive in writing any art under the idea that the written word would be intelligible or certain; or who deemed that writing was at all better than knowledge and recollection of the same matters?
# load nltk
# download stuff
# load submodules. list of submodules is here. it’s usually simplest to just import everything.
from nltk import *
from nltk.book import *
from nltk.corpus import gutenberg
from nltk.corpus import *
# look at files inside a corpus e.g. gutenberg collection. list of corpora is here
# or just gutenberg.fileids()
# get an individual text from a corpus. if you want to put the whole corpus into one variable, use corpus.words()
alice_words = nltk.corpus.gutenberg.words(‘carroll-alice.txt’)
alice = nltk.Text(alice_words)
# or just alice = nltk.Text(nltk.corpus.gutenberg.words(‘carroll-alice.txt’))
# import something from a local file
f = open(‘document.txt’)
raw = f.read()
# import something from a webpage
from urllib import urlopen
url = “http://www.gutenberg.org/files/2554/2554.txt”
raw = urlopen(url).read()
# we might comment out line below if the source is plain text
raw = nltk.clean_html(html)
# tokenize and get ready for use
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
# tokens in a text
# types in a text
# count occurance of a word
# common contexts
# lexical dispersion
text.dispersion_plot(["someword", "otherword", "notherword"])
# type-token ratio
len(text) / len(set(text))
# compute a list of the most frequent words in the corpus
fdist = FreqDist(text)
vocabulary = fdist.keys()
# …and generate a cumulative frequency plot for those words