Bot Vote Detection at Wavo

At Wavo, we offer music labels, event promoters and artists a number of services to promote their events or their music. One of the services we offer is hosted DJ/Remix competitions. If an artist is dropping a new track, they can come to us and host a remix competition, where aspiring producers can submit remixes and everyone votes for the best submissions.

The prizes for these competitions are often pretty amazing - flights, DJ gigs, hardware, etc - so it's natural that there be some creative-but-ethically-challenged methods in getting out those votes.

Until very recently, we've done vote fraud detection at Wavo manually: we have scripts that will dump a contest's votes and related metadata as CSV, and then we go through them, looking for suspicious patterns in the data. You can imagine that that is about as fun as it sounds. So, machine learning to the rescue!

Small aside here: I am no expert. This is my first practical application of ML after having completed the Stanford-Coursera ML course. The goal here was to complete a short (max 3-4 week) applied ML project to complement the math and theoretical course material with some real, applied ML. You can be sure that I've made mistakes and there are better ways to accomplish this.

Tools

After the quickest and most cursory search, I chose to get started with Tensorflow, Python and Keras. If I'm honest with myself, I chose them because: buzz. Remember, friends, I had put a relatively hard limit on how much time I was going to spend on this project. Nevertheless, having completed the project, I believe I chose well.

Getting Tensorflow up was pretty straightforward. Apart from some small amount of Googling to figure out why GPU support wasn't working (nVidia CUDA and CuDNN path issues), it was relatively painless to get a solid, GPU-based ML platform up and running. Tensorboard is awesome for visualizing loss, and Keras made setting up the classifier almost embarrassingly easy.

Data

When a vote is recorded on a contest at Wavo, a decent amount of metadata is stored related to that vote. Implicit in the vote is a whole host of attributes: who the vote was for, how many votes the user has made on other contests, for whom, etc.

Additionally - and this is one of the main reasons I chose bot vote detection as my first ML project - we have already done an enormous amount of work manually classifying votes as 'bot' or 'not bot'. However, that is not to say that there were not problems. In fact, the bulk of my time was spent working on the data, not on the code.

Problem: Data Quality

One problem that I encountered was that the data we had was actually relatively low quality when it came to classifying votes as bots. As I mentioned earlier, we had done a lot of work on this, but that work was pretty manual and prone to misclassifications. In particular, we had a tendency to err on the side of caution - if there was doubt, we typically chose not to mark a vote as a bot. This meant that our training data was not ideal.

To resolve this, I went through several clean-up iterations, where I trained the model against a subset of the data, and then ran predictions on another portion. Sorting the predictions by probability, I then manually evaluated the results, and if the classifier detected a bot that I had missed, I marked that vote as 'bot' in the training data. It was tedious, but this process significantly raised the quality of the data.

Problem: Category Encoding

One of the fields that we store in our metadata is a country code: the country from which the vote has been made. Now, you can't just hand a string like 'US' to a neural network and expect it to understand what is going on, you have to encode it. Initially, because it was simplest, I used scikit-learn's LabelEncoder to encode countries to unique numeric values - eg, CA=4, US=26, etc. However, after many, many tests, I came to the realization that encoding countries as ordinal values was not ideal. Ordinal values imply ranking: 4 < 26. Suppose Canada is a country that is associated with more bot activity than usual. The neural network will learn, then, that the value 4 is associated with bots, but it will also learn that low numbers are associated with bots. Whichever poor countries got assigned the index 3 or 5 are going to be more likely to get marked as bots.

Now, the solution to this is called One-Hot Encoding (or dummy encoding), so I swapped out my LabelEncoder for Pandas' get_dummies() function. On reading in the raw data, I took a single column for country, and swapped that with a few hundred boolean columns generated by get_dummies(), each of which represented a single country. If a vote came from Canada, there would be 200-ish columns with values zero, and exactly one column (the Canada column) with a 1. This way, the neural network treats individual countries as boolean features in isolation, and no misleading rankings between countries are learned.

Problem: Model Persistence

One of the things that seems obvious in retrospect but is not included in any of the tutorials is: if you intend to use a model beyond a single train-predict run, then you need to persist the model. And, very important, that includes persisting any encodings.

For example, suppose the training set includes 65 distinct countries. Naively, you can assign each of those countries one column in the data set, and then set 1 or 0 accordingly (this is what get_dummies() will do by default). Having trained and saved the model, now suppose you want to apply it to a new data set, in which only 35 of the 65 countries are represented? Calling get_dummies() on that data set will give you 35 columns. If you toss those into your neural network, it won't work, you'll have an input-shape mismatch (you're missing 30 columns!). You have a similar problem if your new data set includes all 65 countries and a 66th, unseen country. In that case, get_dummies will generate 66 columns; a mismatch again.

So, categorical features need to be one-hot encoded and the categorical ordering of the column needs to be persisted in some form. To accomplish this, for all of our categorical features, I defined a simple python array of all possible values in addition to "empty" (the category was left empty) and "unknown" (the category is new / not in the list of possible categories). When get_dummies() is called, the result is re-indexed using this all-possible-categories array, which means that even if my training set only includes 4 distinct countries, all 200-ish columns will be outputted for each vote. Likewise, when it comes time to make predictions, we can hand the prediction script a single vote and it will correctly encode that vote's country column, and the 200-ish other zero-valued country columns.

Network Structure

Now, I knew from the start that I wanted to work with a neural network (even if some simpler representation would have been easier or more accurate). I specifically wanted to learn more about NNs, and this seemed a suitable problem. However, big question: structure? I had no idea. Samples online suggested that a densely connected NN with one to two hidden layers and a sigmoid output would do the job, so that's the direction I went.

There remained a number of questions to answer (which were evaluated by running the training and then testing the F1 score of the model on the validation set):

How many hidden layers?
- Testing one, two and three hidden layers, I seemed to get the best results on the validation set with two layers. A single hidden layer trained quickly, but levelled off at a lower accuracy/F1.
Dropout?
- Dropout basically hides neurons at random while training, so as to encourage the network to learn more robustly and avoid overfitting the training set. I tested the following configurations of dropout (input, layer1, layer2): (0, 0, 0), (0, 0.2, 0.2), (0, 0.5, 0.5), (0.2, 0.2, 0.2), (0.5, 0.5, 0.5)
- I was specifically interested in input-layer dropout (feature dropout). On that axis, tests suggested that 0.2 was optimal. For the hidden layers, 0.5 dropout ended up being best, particularly with longer training.
How many neurons per layer?
- I tested my network at 10, 30, 50, 100, 200, 400 neurons per layer. Running the tests for a short 10 epochs (10 training runs through the data), my best results were between 50 and 200 neurons. Running again at 40 epochs confirmed 200 neurons as the ideal size.
Data balance
- The data we have is not balanced. There are far more legitimate votes than bot votes. Should the network be trained against a balanced (by undersampling the 'not-bot' votes) data set, or should it be trained against all of my data?
- I tested balanced, 2x as many non-bots, 4x non-bots, and completely unbalanced. For our dataset, it appears that the benefits of more data (more votes) outweigh the danger that the NN learns to just say "not-bot" all the time, because the best balance for us was no balancing (though, to be fair, I populated the training set with data from contests that already had significant known cheating, so we had a relatively high number of bot votes in the set).
- Side note: it occurs to me that I should test oversampling the bot votes as an alternative to undersampling the not-bot votes.

Final network structure

Once I had run through those tests, I landed on the following configuration:
A two layer neural network with 200 neurons per hidden layer, 0.2 dropout on the input layer and 0.5 dropout on each hidden layer. The training data consisted of 37838 votes, of which 6123 had been marked as 'bots'.

Results

Given this setup, I let the network train for 200 epochs, and ended up with the following results:

TRAIN PERFORMANCE
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00     31715
        1.0       0.99      0.99      0.99      6123

avg / total       1.00      1.00      1.00     37838

37664/37838 [============================>.] - ETA: 0s
acc: 99.74%

VALIDATION PERFORMANCE
             precision    recall  f1-score   support

        0.0       0.99      0.99      0.99     15805
        1.0       0.94      0.95      0.95      2235

avg / total       0.99      0.99      0.99     18040

17920/18040 [============================>.] - ETA: 0s
acc: 98.67%

The two numbers I had been struggling hardest with were the precision and recall for bot-votes on the validation set. These results indicate that approximately one out of twenty detected bot votes will be a false positive, and one out of twenty real bot votes will be falsely marked as 'not-bot'. Those are pretty good results for a first pass. Once this is deployed, I intend to have it run automatically categorize new votes for me to periodically review. I will correct any false-positives and false-negatives, which should help me to grow accurate training and validation sets significantly. Retraining on the new data should help improve performance even more.

One more reminder, I am not a machine learning expert, and this is not machine learning advice. I welcome any suggestions.