Having collected my twitter corpus, I had to tag it. That is, eventually I’m going to want to take a selected subcorpus and have the computer guess which of my 10 authors wrote each tweet in the subcorpus, but that guess will be based on statistical models, and statistical models take numbers as input, not raw linguistic data. So I had to count something in each tweet, and the total counts will be my starting place for statistics.
Since I’m replicating Rico-Sulayes 2012, I used basically the same list of features that he did. So in each tweet, here’s what I counted:
- Structural features such as @-replies, @-mentions, retweets, hyperlinks, and emoticons. Just for fun, here’s the regular expression I used to search for emoticons:
- Syntactic features, which could conceivably include a lot of things, but Rico-Sulayes only considers multi-word function words such as debajo de ‘below’ and de otra manera ‘otherwise.’
- Lexical features including any word, punctuation mark, or hashtag that turns up in the whole corpus. So if anyone uses the word allá ‘there’, then I count it every time it appears. If someone else spells it alla without the accent, that’s a different feature to be counted. The word “the” is a feature, but so is a word that only shows up one time. Altogether there are nearly 9000 unigram types in the corpus, and I counted each one of them separately.
Over 9,000 features altogether in 1,930 tweets — I’m not going to go through and count them manually. So I wrote a script that would accomplish it for me. It took over an hour to run, probably in part because it wasn’t terribly well written, and then I realized that I had made a small mistake in the code and the results didn’t save, and I had to do it over again. (At this point I went to sleep and checked the results the following morning.)
So for each tweet, I’ve created an ordered pair, where the first entry in the pair is the username of the person who tweeted it, and the second entry is a list of counts for every feature I tagged. Those counts are going to be the first input in the statistical modeling. Next step is feature reduction — all 9000 of those features are not going to be useful for author identification, so I have to try to find the subset that works best — and I’ll explain more about how that works when I do it.