Amy Boyd, cloud developer advocate at Microsoft, studied computer science at university and completed her research project on natural language processing and machine learning in 2012. Back then Boyd was a big fan of The X Factor TV music talent show and was also witnessing the rapid ascent of Twitter. She decided to do her research project on predicting the outcome of The X Factor reality competition using Twitter data.
Her view was that The X Factor is both a singing contest and a popularity contest. Therefore, with Twitter being a platform where people share their opinions, data from the social network could be used to gauge popularity and determine who would win.
Boyd had to create a sentiment analysis classifier to determine whether a tweet was positive or negative that was trained on domain specific data. This was somewhat complicated because the people talking about The X Factor on Twitter would use slang terms, such as ‘sick’ to mean ‘good’ so she had to be careful about what classed as positive or negative. She then had to rank the contestants.
Ultimately Boyd was able to predict who would win each week, based on the number of votes from the public when there were six contestants or fewer. During the course of her investigation she realised that there were biases in the data. One is that Twitter has more positive data than negative at a ratio of about 70:30.
This meant that her negative classifier was never as good as her positive classifiers as there was less data and therefore less to learn from. She also found that SVMs, or support vector machines, performed better than used Naive-Bayes classification algorithms.
This is an example of how it important it is to have a passion for what you are working on or investigating. “The topic that I chose meant that I investigated the data more because I had lived that scenario. I would share my opinion the next day in class on who had done well and who was rubbish that week. So I dove into the data because I was interested. That wouldn’t have necessarily been the case if it had been a basic research dataset,” said Boyd
Often the issues around data science can be really serious so it was refreshing to hear how machine learning and natural language processing could be used to find the answer to a lighter question.
Amy Boyd was speaking at The Ethics of Artificial Intelligence at the Microsoft Reactor London.