Today I finally finished debugging my automated tagger and ran some classification experiments. I’m using some machine learning algorithms from the Weka software package:

*The C4.5 decision tree*goes through the data and figures out a set of rules or discrete decision points that best characterize the data. So, for example, imagine I have three subjects in my corpus, Raúl, Elisa, and Marco. The decision tree might say that if a tweet has more than two emoticons in it, then Raúl wrote it; if not, you can tell Elisa from Marco because she uses vowels with accents and he doesn’t.*Bernoulli and multinomial naive Bayes algorithms*work by calculating conditional probabilities, like the probability that a tweet is by Raúl given that it has an emoticon, or the probability that a tweet is by Marco given that it has the letter ó.*Support vector machines*I don’t understand well enough to explain, but I’ll read up in time to explain it in my paper. I’m using it because I’m replicating a study that used it. If you’re curious, there’s always Wikipedia.

For each of these algorithms, I’m using ten-fold cross validation. This means that you split the data into ten stratified random samples (i.e. each sample has an even representation from all the writers in the corpus), train the classifier on nine of them, and then have it give its best guess on the tenth sample. You repeat this ten times over, leaving out each of the samples once, so that in the end all of the texts in the corpus have been part of the training data nine times, and part of the testing data once. Then, for each text, you count up whether the classifier correctly identified the author when that text was in the testing data.

You might remember that I was tagging each tweet on over 9000 features. A lot of those features are going to be useless noise to the classifier, so before I run the different algorithms, I have to calculate which features are most likely to be helpful. I started by leaving out everything that occurs less than 18 times, because my least prolific author wrote 180 tweets, so each author will have at least 18 tweets in each subsample. This narrowed it down from 9000 to 197. Then I made even smaller sets of features using information gain and correlation-based feature subsets.

In Rico-Sulayes 2012, he gets better results using some of the fancier techniques like correlation-based feature subsets and support vector machines. My results look a bit different.

Features occurring 18 times or more | Information gain | CFS | |
---|---|---|---|

Decision tree | 66.8% | 68.5% | 60.1% |

Bernoulli naive Bayes | 59.1% | 60.2% | 48.7% |

Multinomial naive Bayes | 65.3% | 66.2% | 55.0% |

SVM | 64.1% | 65.5% | 52.4% |

In my data, the correlation-based feature subset procedure eliminates so many features that it makes the classifiers perform worse across the board. Also, it seems like my best performing classifier is the decision tree, which Rico-Sulayes included more as a baseline to see how much better the other classifiers would do. And even my best result is only 68.5% of texts classified correctly, while the state of the art is up in the 90s.

Writing this paper is going to be interesting.

## Leave a Reply