I meant to post this a month or so ago, when I was conducting my study of casual tweeting, but didn’t get to it. No harm in posting it now, I guess — code doesn’t go bad, fortunately.
Note: this requires Linux/Unix/OSX, Python 2.6 and the tweepy library. It might also work on Windows, but I haven’t checked.
1. Fetching a single user’s tweets with twitter_fetch.py
The purpose of the script below is to automatically retrieve all new tweets by one or more users, where “new” means all tweets that have been added since the last round of archiving. If the script is called for the first time for a given user, it will try to retrieve all available tweets for that person. It relies on the tweepy package for Python, which is one of a number of libraries providing access to the Twitter API. In case you’re looking for a library for R, check out twitteR.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| import sys
import time
import os
import tweepy
# make sure that the directory 'Tweets' exists, this is
# where the tweets will be archived
wdir = 'Tweets'
user = sys.argv[1]
id_file = user + '.last_id'
timeline_file = user + '.timeline'
if os.path.exists(wdir + '/' + id_file):
f = open(wdir + '/' + id_file, 'r')
since = int(f.read())
f.close()
tweets = tweepy.api.user_timeline(user, since_id=since)
else:
tweets = tweepy.api.user_timeline(user)
if len(tweets) > 0:
last_id = str(tweets[0].id)
tweets.reverse()
# write tweets to file
f = open(wdir + '/' + timeline_file, 'a+')
for tweet in tweets:
output = str(tweet.created_at) + '\t' + tweet.text.replace('\r', ' ').encode('utf-8') + '\t' + tweet.source.encode('utf-8') + '\n'
f.write(output)
print output
f.close()
# write last id to file
f = open(wdir + '/' + id_file, 'w')
f.write(last_id)
f.close()
else:
print 'No new tweets for ' + user |
The code is pretty straight-forward. I wrote it without really knowing Python beyond the bare essentials and relying heavily on IPython‘s code completion. Actual retrieval of tweets happens in a single line:
tweets = tweepy.api.user_timeline(user)
The rest of the script is devoted to managing the data and making sure only new tweets are retrieved. This is done via the since_id parameter which is fed the last recorded id that has been saved to the user’s id file in the previous round of archiving. There are more elegant ways of doing this, but any improvements are up to you.
2. Fetching a bunch of different users’ tweets with twitter_fetch_all.sh
Second comes a very simple bash script. The only thing it does is call twitter_fetch.py once for each user in a list of people you want to track. Again, there are probably other ways of doing this, but I wanted to keep the functions of the two different scripts separate.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| #!/bin/bash
# This is will perform twitter_fetch,py on the twitter_users[] array. Add any number of twitter_users[NUMBER]="USER" lines below
# to archive additional accounts.
# --- twitter user list ---
twitter_users[0]="SomeUser"
twitter_users[1]="SomeOtherUser"
twitter_users[2]="YetAnotherUser"
twitter_users[3]="YouGetTheIdea"
# --- execute twitter_fetch.py ---
for twitter_user in ${twitter_users[*]}
do
echo "Getting tweets for user $twitter_user"
python twitter_fetch.py $twitter_user
echo "Done."
echo ""
done |
You should place this in the same directory as twitter_fetch.py and modify it to suit your needs.
3. Automating the whole thing with a cronjob
Finally, here’s a cron directive I used to automate the process and log the result in case any errors occur. Read the linked Wikipedia article if you’re unfamiliar with cron, it’s a very convenient way of automating tasks on Linux/Unix.
0 * * * * sh /root/twitter_fetch_all.sh >/root/twitter_fetch.log
(Yes, I’m running this as root. Because I can. And because it’s an EC2 instance with nothing else on it anyway.)
Hope it’s useful to someone, let me know if you have any questions.