Jun 25

Here’s an updated version of my script from last month, something I’ve been meaning to do for a while. I thank Anatol Stefanowitsch and Gábor Csárdi for improving my quite sloppy code.


# Load twitteR and igraph packages.
library(twitteR)
library(igraph)


# Start a Twitter session.
sess <- initSession('USERNAME', 'PASSWORD')


# Retrieve a maximum of 20 friends/followers for yourself or someone else Note that
# at the moment, the limit parameter does not [yet] seem to be working.
friends.object <- userFriends('USERNAME', n=20, sess)
followers.object <- userFollowers('USERNAME', n=20, sess)


# Retrieve the names of your friends and followers from the friend
# and follower objects.
friends <- sapply(friends.object,name)
followers <- sapply(followers.object,name)


# Create a data frame that relates friends and followers to you for expression in the graph
relations <- merge(data.frame(User='YOUR_NAME', Follower=friends), data.frame(User=followers, Follower='YOUR_NAME'), all=T)


# Create graph from relations.
g <- graph.data.frame(relations, directed = T)


# Assign labels to the graph (=people's names)
V(g)$label <- V(g)$name


# Plot the graph using plot() or tkplot().
tkplot(g)

Tagged with:
May 24

Edit: I’ve posted an updated version of the script here. It is not quite as compressed as Anatol’s version, but I think it’s a decent compromise between readability and efficiency. :-)

I hacked together some code for R last night to visualize a Twitter graph (=who you are following and who is following you) that I briefly showed at the session on visualizing text today at THATCamp and that I wanted to share. My comments in the code are very basic and there is much to improve, but in the spirit of “release early, release often”, I think it’s better to get it out there right away.

Ingredients:

Note that packages are most easily installed with the install.packages() function inside of R, so R is really the only thing you need to download initially.

Code:

# Load twitteR package
library(twitteR)

# Load igraph package
library(igraph)


# Set up friends and followers as vectors. This, along with some stuff below, is not really necessary, but the result of my relative inability to deal with the twitter user object in an elegant way. I'm hopeful that I will figure out a way of shortening this in the future

friends <- as.character()
followers <- as.character()

# Start an Twitter session. Note that the user through whom the session is started doesn't have to be the one that your search for in the next step. I'm using myself (coffee001) in the code below, but you could authenticate with your username and then search for somebody else.

sess <- initSession('coffee001', 'mypassword')

# Retrieve a maximum of 500 friends for user 'coffee001'.

friends.object <- userFriends('coffee001', n=500, sess)

# Retrieve a maximum of 500 followers for 'coffee001'. Note that retrieving many/all of your followers will create a very busy graph, so if you are experimenting it's better to start with a small number of people (I used 25 for the graph below).

followers.object <- userFollowers('coffee001', n=500, sess)

# This code is necessary at the moment, but only because I don't know how to slice just the "name" field for friends and followers from the list of user objects that twitteR retrieves. I am 100% sure there is an alternative to looping over the objects, I just haven't found it yet. Let me know if you do...

for (i in 1:length(friends.object))
{
friends <- c(friends, friends.object[[i]]@name);
}


for (i in 1:length(followers.object))
{
followers <- c(followers, followers.object[[i]]@name);
}


# Create data frames that relate friends and followers to the user you search for and merge them.

relations.1 <- data.frame(User='Cornelius', Follower=friends)
relations.2 <- data.frame(User=followers, Follower='Cornelius')
relations <- merge(relations.1, relations.2, all=T)

# Create graph from relations.

g <- graph.data.frame(relations, directed = T)

# Assign labels to the graph (=people's names)

V(g)$label <- V(g)$name

# Plot the graph.

plot(g)

For the screenshot below I've used the tkplot() method instead of plot(), which allows you to move around and highlight elements interactively with the mouse after plotting them. The graph only shows 20 people in order to keep the complexity manageable.

Tagged with:
Jul 28

R Lesson 2


text<-c("This is a first example sentence.", "And this is a second example sentence.")

# gsub replaces stuff in strings

> gsub ("second", "third", text)
SEARCH-REPLACE-SUBJECT
[1] "This is a first example sentence."
[2] "And this is a third example sentence."
> gsub ("n", "X", text)
[1] "This is a first example seXteXce."
[2] "AXd this is a secoXd example seXteXce."
> gsub ("is", "was", text)
[1] "Thwas was a first example sentence."
[2] "And thwas was a second example sentence."

---

Perl-style regex

^ beginning of str, e.g. "^x", ***OR*** NOT inside of []
$ end of str, e.g. "x$"
. any other char
\ escape char - TWO ("\\") needed
[] character classes, e.g. [aeiou] vowels, [a-h] is same as [abcdefgh]
{MIN,MAX} number of immediately preceding unit (chacter)

examples
lo+l

> grep("analy[sz]e", c("analyze", "analyse", "moo"), perl=T, value=T)
[1] "analyze" "analyse"

> grep("(first|second)", text, perl=T, value=T)
[1] "This is a first example sentence."
[2] "And this is a second example sentence."
> grep("(first|lalala)", text, perl=T, value=T)
[1] "This is a first example sentence."
>

> grep("ab{2}", z, perl=T, value=T)
[1] "aabbccdd"
> grep("(ab){2}", z, perl=T, value=T)
[1] "ababcdcd"
>
>
> gsub("a (first|second)", "another", text, perl=T)
[1] "This is another example sentence."
[2] "And this is another example sentence."
>
>
>
>
> gsub("[abcdefgh]", "X", text, perl=T)
[1] "TXis is X Xirst XxXmplX sXntXnXX."
[2] "AnX tXis is X sXXonX XxXmplX sXntXnXX."

> grep("forg[eo]t(s|ting|ten)?_v", a.corpus.file, perl=T, value=T)
all forms of forget

*? lazy matching e.g.
gregexpr("s.*?s", text[1], perl=T)

> gregexpr("s.*?s", text[1], perl=T)
[[1]]
[1] 4 14
attr(,"match.length")
[1] 4 12

# note: things that are matched are consumed and can then not be found again in the same passtext

> gsub("(19|20)[0-9]{2}", "YEAR", text)
[1] "They killed 250 people in YEAR." "No, it was in YEAR."
> #replaces only 19xx and 20xx

---

> textfile<-scan(file.choose(), what="char", sep="\n")
Enter file name: corp_gpl_short.txt
Read 9 items
> textfile<-tolower(textfile)
> textfile
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> unlist(strsplit(textfile, "//W"))
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> text_split<-unlist(strsplit(textfile, "//W"))
> text_split
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
>
> text_split<-unlist(strsplit(textfile, "//W"))
> text_split
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> text_split<-unlist(strsplit(textfile, "\\W"))

> textfile<-scan(file.choose(), what="char", sep="\n")
Enter file name: corp_gpl_short.txt
Read 9 items
> textfile<-tolower(textfile)
> textfile
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> unlist(strsplit(textfile, "//W"))
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."

> text_split<-unlist(strsplit(textfile, "//W+"))
> text_split
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> sort(table(text_split), decreasing=T)
text_split
to software the free and general
9 9 7 5 4 3 3
is it license public your by change
3 3 3 3 3 2 2
for foundation freedom gnu most other share
2 2 2 2 2 2 2
all any applies apply are authors away
1 1 1 1 1 1 1
can commit contrast covered designed guarantee instead
1 1 1 1 1 1 1
intended its library licenses make of program
1 1 1 1 1 1 1
programs s some sure take this too
1 1 1 1 1 1 1
users using whose you
1 1 1 1
>

> text_freqs
text_split
to software the free and general is
9 7 5 4 3 3 3
it license public your by change for
3 3 3 3 2 2 2
foundation freedom gnu most other share all
2 2 2 2 2 2 1
any applies apply are authors away can
1 1 1 1 1 1 1
commit contrast covered designed guarantee instead intended
1 1 1 1 1 1 1
its library licenses make of program programs
1 1 1 1 1 1 1
s some sure take this too users
1 1 1 1 1 1 1
using whose you
1 1 1
> text_freqs[text_freqs>1]
text_split
to software the free and general is
9 7 5 4 3 3 3
it license public your by change for
3 3 3 3 2 2 2
foundation freedom gnu most other share
2 2 2 2 2 2
>

> !(text_split %in% stop_list)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[37] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[49] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[61] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> text_stopremoved<-text_split[!(text_split %in% stop_list)]
> text_stopremoved
[1] "licenses" "for" "most" "software" "are"
[6] "designed" "to" "take" "away" "your"
[11] "freedom" "to" "share" "change" "it"
[16] "by" "contrast" "gnu" "general" "public"
[21] "license" "is" "intended" "to" "guarantee"
[26] "your" "freedom" "to" "share" "change"
[31] "free" "software" "to" "make" "sure"
[36] "software" "is" "free" "for" "all"
[41] "its" "users" "this" "general" "public"
[46] "license" "applies" "to" "most" "free"
[51] "software" "foundation" "s" "software" "to"
[56] "any" "other" "program" "whose" "authors"
[61] "commit" "to" "using" "it" "some"
[66] "other" "free" "software" "foundation" "software"
[71] "is" "covered" "by" "gnu" "library"
[76] "general" "public" "license" "instead" "you"
[81] "can" "apply" "it" "to" "your"
[86] "programs" "too"
>

# LOAD an R file
source("something.r")

Tagged with:
Jul 28

(This post documents the first day of a class on R that I took at ESU C&T. I is posted here purely for my own use.)


R Lesson 1

> 2+3; 2/3; 2^3
[1] 5
[1] 0.6666667
[1] 8

---

Fundamentals - Functions

> log(x=1000, base=10)
[1] 3

---

(Formals describes the syntax of other functions)

formals(sample)

---

Variables

( <- allows you to save something in a data structure (variable) )
> a<-2+3
> a
[1] 5

# is for comments

whitespace doesn't matter

---
# Pick files
file.choose()

# Get working dir
getwd()

# Set working dir
setwd("..")

# Save
> save(VARIABLE_NAME, file=file.choose())
Fehler in save(test, file = file.choose()) : Objekt ‘test’ nicht gefunden
> save.image("FILE_NAME")

---

> setwd("/home/cornelius/Code/samples/Brown_95perc")
> getwd()
[1] "/home/cornelius/Code/samples/Brown_95perc"
> dir()

> my_array <- c(1,2,3,4)
> my_array
[1] 1 2 3 4
> my_array <- c("lalala", "lululu", "bla")
> my_array2 <- c(1,2,3,4)
> c(my_array, my_array2)
[1] "lalala" "lululu" "bla" "1" "2" "3" "4"
>

# it is possible to add something to ALL values in a vector, i.e.
my_array2 + 10

# c (conc) makes a list
stuff1<-c(1,2,3,4,5)

---

# sequence starts at 1 (first arg), goes on for 5 (second arg), increments by 1 (third arg)
seq(1, 5, 1)

---

# put a file into a corpus vector
# what=real|char sep=seperator
> my_corpus<-scan(file=file.choose(), what="char", sep="\n")

# unique elements in my array
unique(array)

# count elements in an array
table(array)

# sort elements in an array
sort(table(array))

---
# this tells me the position of the elements in my text that aren't "this"
> values<-which(my_little_corpus!="this")
> values
[1] 2 3 4 5 6 7 8 9 11 12 13 14

# this will produce TRUE|FALSE for my condition (is this element "this")
> values<-my_little_corpus!="this"
> values
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[13] TRUE TRUE

# this will return the array without "this"
> values<-my_little_corpus[my_little_corpus!="this"]
> values
[1] "is" "just" "a" "little" "example" "bla" "bla"
[8] "bla" "is" "the" "third" "line"

...

> cc<-c("banana", "bagel")
> cc == "banana"; cc!="banana" #
[1] TRUE FALSE
[1] FALSE TRUE
> "banana" %in% cc
[1] TRUE
> c("bagel", "banana") %in% cc
[1] TRUE TRUE
> match ("banana", cc)
[1] 1
> match (c("bagel","banana"), cc)
[1] 2 1

# match looks for a list of tokens and returns their position in the datastructure

---
> cat(bb, sep="\n", file=scan(what="char"), append=F)
# write the contents of bb to a file, ask the user for file

moo<-scan(what="char")
# read something the user types into a var

# Clear Mem
> rm(list=ls(all=T))
>

---

# create vector1 (ordered)
vec1<-c("a","b","c","d","e","f,",g","h","i","j")

# oder
# > letters[1:10]
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

# create vector2 (random)
# > vector2<-sample(vector1)

---

length()
# number of elements

nchar()
# number of characters

> aa<-"know"
> nchar(aa)
[1] 4
> aa<-c("I","do","not","know")
> nchar(aa)
[1] 1 2 3 4
> lala<-c("cat","gnu","hippopotamus")
> lala
[1] "cat" "gnu" "hippopotamus"
> nchar(lala)
[1] 3 3 12

> substr("hippopotamus", 0, 5)
[1] "hippo"
>

# like explode() / implode()
paste (string, sep="my_seperator", collapse="stuff to put in")

---

# percentages
x/sum(x)

barplot (1,2,3)

Read in corpus data and build a list of words frequencies
1) scan file
2) strsplit by " "
3) unlist to make vector
4) make a table with freqs
5) sort
6) output

#search for strings
grep("needle", haystack)

> grep("is", text, value=T)
[1] "This is a first example sentence."
[2] "And this is a second example sentence."
> grep("And", text, value=T)
[1] "And this is a second example sentence."
> grep("sentence", text, value=T)
[1] "This is a first example sentence."
[2] "And this is a second example sentence."
>

gregexpr
# alternative to grep, returns a list of vectors

> mat<-gregexpr("e", text)
> mat
[[1]]
[1] 17 23 26 29 32
attr(,"match.length")
[1] 1 1 1 1 1

[[2]]
[1] 16 22 28 31 34 37
attr(,"match.length")
[1] 1 1 1 1 1 1

> unlist(mat)
[1] 17 23 26 29 32 16 22 28 31 34 37
> mat<-gregexpr("sentence", text)
> sapply (mat, c)
[1] 25 30

Tagged with:
Jul 11

fileids() The files of the corpus
fileids([categories]) The files of the corpus corresponding to these categories
categories() The categories of the corpus
categories([fileids]) The categories of the corpus corresponding to these files
raw() The raw content of the corpus
raw(fileids=[f1,f2,f3]) The raw content of the specified files
raw(categories=[c1,c2]) The raw content of the specified categories
words() The words of the whole corpus
words(fileids=[f1,f2,f3]) The words of the specified fileids
words(categories=[c1,c2]) The words of the specified categories
sents() The sentences of the specified categories
sents(fileids=[f1,f2,f3]) The sentences of the specified fileids
sents(categories=[c1,c2]) The sentences of the specified categories
abspath(fileid) The location of the given file on disk
encoding(fileid) The encoding of the file (if known)
open(fileid) Open a stream for reading the given corpus file
root() The path to the root of locally installed corpus
readme() The contents of the README file of the corpus

Tagged with:
Jun 19

# NLTK code for building a corpus of Twitter messages (or any number of text files in a dir)

import glob, os
path = ‘E:\Corpora\Twitter\plaintext\mostfrequent’
for infile in glob.glob (os.path.join(path, ‘*.txt’) ):
f = open(infile)
raw = raw + ‘ ‘ + f.read()

tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
len(text)

Tagged with:
Jun 09

# PREP

# load nltk
import nltk

# download stuff
nltk.download()

# load submodules. list of submodules is here. it’s usually simplest to just import everything.
from nltk import *
from nltk.book import *
from nltk.corpus import gutenberg
from nltk.corpus import *
# etc

# look at files inside a corpus e.g. gutenberg collection. list of corpora is here
nltk.corpus.gutenberg.fileids()
# or just gutenberg.fileids()

# get an individual text from a corpus. if you want to put the whole corpus into one variable, use corpus.words()
alice_words = nltk.corpus.gutenberg.words(‘carroll-alice.txt’)
alice = nltk.Text(alice_words)
# or just alice = nltk.Text(nltk.corpus.gutenberg.words(‘carroll-alice.txt’))

# import something from a local file
f = open(‘document.txt’)
raw = f.read()

# import something from a webpage
from urllib import urlopen
url = “http://www.gutenberg.org/files/2554/2554.txt”
raw = urlopen(url).read()
# we might comment out line below if the source is plain text
raw = nltk.clean_html(html)

# tokenize and get ready for use
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

# PROCEDURES

# tokens in a text
len(text)

# types in a text
len(set(text))

# count occurance of a word
text.count(“word”)

# concordance
text.concordance(“someword”)

# similarity
text.similar(“monstrous”)

# common contexts
text.common_contexts(["someword", "otherword"])

# lexical dispersion
text.dispersion_plot(["someword", "otherword", "notherword"])

# type-token ratio
len(text) / len(set(text))

# compute a list of the most frequent words in the corpus
fdist = FreqDist(text)
vocabulary = fdist.keys()
vocabulary[:50]
# …and generate a cumulative frequency plot for those words
fdist.plot(50, cumulative=True)

Tagged with:
preload preload preload