NLTK corpus functions

On July 11, 2009, in Code, by cornelius

fileids() The files of the corpus
fileids([categories]) The files of the corpus corresponding to these categories
categories() The categories of the corpus
categories([fileids]) The categories of the corpus corresponding to these files
raw() The raw content of the corpus
raw(fileids=[f1,f2,f3]) The raw content of the specified files
raw(categories=[c1,c2]) The raw content of the specified categories
words() The words of the whole corpus
words(fileids=[f1,f2,f3]) The words of the specified fileids
words(categories=[c1,c2]) The words of the specified categories
sents() The sentences of the specified categories
sents(fileids=[f1,f2,f3]) The sentences of the specified fileids
sents(categories=[c1,c2]) The sentences of the specified categories
abspath(fileid) The location of the given file on disk
encoding(fileid) The encoding of the file (if known)
open(fileid) Open a stream for reading the given corpus file
root() The path to the root of locally installed corpus
readme() The contents of the README file of the corpus

Tagged with:  

NLTK corpora

On July 11, 2009, in Things I want to look up later, by cornelius

[*] alpino………….. Alpino Dutch Treebank
[*] nombank.1.0……… NomBank Corpus 1.0
[*] abc…………….. Australian Broadcasting Commission 2006
[*] maxent_ne_chunker… ACE Named Entity Chunker (Maximum entropy)
[*] conll2000……….. CONLL 2000 Chunking Corpus
[*] chat80………….. Chat-80 Data Files
[*] brown…………… Brown Corpus
[*] brown_tei……….. Brown Corpus (TEI XML Version)
[*] cmudict…………. The Carnegie Mellon Pronouncing Dictionary (0.6)
[*] biocreative_ppi….. BioCreAtIvE (Critical Assessment of Information
Extraction Systems in Biology)
[*] cess_cat………… CESS-CAT Treebank
[*] conll2002……….. CONLL 2002 Named Entity Recognition Corpus
[*] conll2007……….. Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[*] city_database……. City Database
[*] indian………….. Indian Language POS-Tagged Corpus
[*] shakespeare……… Shakespeare XML Corpus Sample
[*] dependency_treebank. Dependency Parsed Treebank
[*] inaugural……….. C-Span Inaugural Address Corpus
[*] ieer……………. NIST IE-ER DATA SAMPLE
[*] gutenberg……….. Project Gutenberg Selections
[*] gazetteers………. Gazeteer Lists
[*] names…………… Names Corpus, Version 1.3 (1994-03-29)
[*] mac_morpho………. MAC-MORPHO: Brazilian Portuguese news text with
part-of-speech tags
[*] movie_reviews……. Sentiment Polarity Dataset Version 2.0
[*] cess_esp………… CESS-ESP Treebank
[*] genesis…………. Genesis Corpus
[*] kimmo…………… PC-KIMMO Data Files
[*] floresta………… Portuguese Treebank
[*] qc……………… Experimental Data for Question Classification
[*] nps_chat………… NPS Chat
[*] paradigms……….. Paradigm Corpus
[*] pil…………….. The Patient Information Leaflet (PIL) Corpus
[*] stopwords……….. Stopwords Corpus
[*] propbank………… Proposition Bank Corpus 1.0
[ ] pe08……………. Cross-Framework and Cross-Domain Parser
Evaluation Shared Task
[*] state_union……… C-Span State of the Union Address Corpus
[*] sinica_treebank….. Sinica Treebank Corpus Sample
[*] ppattach………… Prepositional Phrase Attachment Corpus
[*] senseval………… SENSEVAL 2 Corpus: Sense Tagged Text
[*] problem_reports….. Problem Report Corpus
[*] reuters…………. The Reuters-21578 benchmark corpus, ApteMod
version
[*] swadesh…………. Swadesh Wordlists
[*] rte…………….. PASCAL RTE Challenges 1, 2, and 3
[*] udhr……………. Universal Declaration of Human Rights Corpus
[*] treebank………… Penn Treebank Sample
[*] unicode_samples….. Unicode Samples
[*] verbnet…………. VerbNet Lexicon, Version 2.1
[*] wordnet_ic………. WordNet-InfoContent
[*] book_grammars……. Grammars from NLTK Book
[*] words…………… Word Lists
[*] punkt…………… Punkt Tokenizer Models
[*] wordnet…………. WordNet
[*] large_grammars…… Large context-free grammars for parser
comparison
[*] ycoe……………. York-Toronto-Helsinki Parsed Corpus of Old
English Prose
[*] spanish_grammars…. Grammars for Spanish
[*] rslp……………. RSLP Stemmer (Removedor de Sufixos da Lingua
Portuguesa)
[*] tagsets…………. Help on Tagsets
[*] sample_grammars….. Sample Grammars
[*] timit…………… TIMIT Corpus Sample
[*] maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
[*] toolbox…………. Toolbox Sample Files
[*] basque_grammars….. Grammars for Basque
[*] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
[*] webtext…………. Web Text Corpus
[*] switchboard……… Switchboard Corpus Sample

Tagged with:  

Accessing corpora: nltk.corpus
String processing: nltk.tokenize, nltk.stem
Collocation discovery: nltk.collocations
Part-of-speech tagging: nltk.tag
Classification: nltk.classify, nltk.cluster
Chunking: nltk.chunk
Parsing: nltk.parse
Semantic interpretation: nltk.sem, nltk.inference
Evaluation metrics: nltk.metrics
Probability and estimation; nltk.probability
Applications: nltk.app, nltk.chat

Tagged with:  

NLTK: add all files in a directory to a corpus

On June 19, 2009, in Code, by cornelius

# NLTK code for building a corpus of Twitter messages (or any number of text files in a dir)

import glob, os
path = ‘E:\Corpora\Twitter\plaintext\mostfrequent’
for infile in glob.glob (os.path.join(path, ‘*.txt’) ):
f = open(infile)
raw = raw + ‘ ‘ + f.read()

tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
len(text)

Tagged with:  

NLTK: e-philology cheat sheet

On June 9, 2009, in Code, by cornelius

# PREP

# load nltk
import nltk

# download stuff
nltk.download()

# load submodules. list of submodules is here. it’s usually simplest to just import everything.
from nltk import *
from nltk.book import *
from nltk.corpus import gutenberg
from nltk.corpus import *
# etc

# look at files inside a corpus e.g. gutenberg collection. list of corpora is here
nltk.corpus.gutenberg.fileids()
# or just gutenberg.fileids()

# get an individual text from a corpus. if you want to put the whole corpus into one variable, use corpus.words()
alice_words = nltk.corpus.gutenberg.words(‘carroll-alice.txt’)
alice = nltk.Text(alice_words)
# or just alice = nltk.Text(nltk.corpus.gutenberg.words(‘carroll-alice.txt’))

# import something from a local file
f = open(‘document.txt’)
raw = f.read()

# import something from a webpage
from urllib import urlopen
url = “http://www.gutenberg.org/files/2554/2554.txt”
raw = urlopen(url).read()
# we might comment out line below if the source is plain text
raw = nltk.clean_html(html)

# tokenize and get ready for use
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

# PROCEDURES

# tokens in a text
len(text)

# types in a text
len(set(text))

# count occurance of a word
text.count(“word”)

# concordance
text.concordance(“someword”)

# similarity
text.similar(“monstrous”)

# common contexts
text.common_contexts(["someword", "otherword"])

# lexical dispersion
text.dispersion_plot(["someword", "otherword", "notherword"])

# type-token ratio
len(text) / len(set(text))

# compute a list of the most frequent words in the corpus
fdist = FreqDist(text)
vocabulary = fdist.keys()
vocabulary[:50]
# …and generate a cumulative frequency plot for those words
fdist.plot(50, cumulative=True)

Tagged with: