Continuing my forensic linguistics twitter project.
Rather than going with my first ten randomly selected participants in #phdchat, I took twenty randomly selected participants to see what I would get. Out of those twenty I eliminated eight, either because they were institutions, not individual tweeters (like @GdnHigherEd I wrote about last time), or because for some reason my data collection script gave me fewer than 100 tweets. I took the other twelve tweeters, with between 120-190 tweets each, and saved them as a new data set.
This is the data set I’m going to use for my computational experiments, at least to begin with. So I’ve spent the early afternoon thinking about how to segment the data into training and testing samples, and reading up on how to do Principal Components Analysis in R.