Thing-a-day #8: Python script for Twitter data collection

Edit: this post originally misrepresented Naomi Barnes’ take on the CMC literature; corrected version below.

One of my term projects this semester is a seminar paper in forensic linguistics that involves statistical authorship attribution — that is, given a piece of text, can you train a computer to figure out who wrote it? I’m trying to replicate Rico-Sulayes (2011) but where he used forum posts, I’m planning to use tweets.

Why tweets? Well, there are actual forensic contexts where someone sends a text message, and the actual identity of the sender of the message is questionable. A teenager goes missing and then their parents get a text that says “Don’t worry, everything’s fine,” and the question is whether it’s really from their kid — something like that. There are various methods that investigators use to try and figure this out, but it’s difficult to replicate experimentally because people don’t like to share their text messages. So how do you conduct research to improve investigative methods?

I’m hoping to use tweets as a publicly available substitute for texts. They’re of a similar length, which is one of the big problems in quantitative authorship attribution — computational linguists like big data sets, but criminal investigators often get only very small texts to work with. There are differences, of course, but from the methodological perspective of comparing a questioned message to a reference corpus of messages of known authorship, it may not matter so much.

This is my second time collecting social media data for a project. (A lot of researchers start off expecting, as sociologist Naomi Barnes writes of her own experience, that “Not once [would] you find a study that actually believed Facebook status updates are worthy pieces of data,” but as she found, the literature is deep; linguistic-ethnographic studies of computer-mediated communication go back over a decade.) The first time I was studying blog comments, and I cut and pasted them by hand into a text file. This time, I thought I could be a little more sophisticated. So I put together this Python script (thanks to Liz Merkhofer for putting me on track):

import twitter
import random
api = twitter.Api()

phdchat = api.GetSearch(term='%23phdchat', per_page=100)
# I'm going to collect tweets from contributors to the #phdchat hashtag

allnames = sorted(set([tweet.user.screen_name for tweet in phdchat]))
names = random.sample(allnames, 10)
# this randomly chooses ten out of the people who posted the last 100 tweets on #phdchat

corpus = {}
for name in names:
    corpus[name] = api.GetUserTimeline(id=name, count=200)
# this pulls as many tweets as I can get from each person in my list of names

It worked, as far as it goes, but it still needs some thinking. I’m not sure #phdchat is the right search term to use, to ensure rough comparability of tweets across authors. (I don’t want to compare, say, Kim Kardashian to Horse ebooks because it might be too easy of a computational problem, and not representative of the forensic situation.) For example, my script did pull in mainly real grad students, but it also included The Guardian Higher Education Network. The Warwick Institute of Advanced Study also got pulled in for having retweeted a @GdnHigherEd post on #phdchat. So I’ve got some more work to do.

Cross-posted at SocPhD

Advertisements
Tagged with: , ,
Posted in thing-a-day
One comment on “Thing-a-day #8: Python script for Twitter data collection
  1. […] = api.GetSearch(term=’%23phdchat’, per_page=100) # I'm going to collect tweets from contributors to […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: