Improving a Text Classifier and Generator: Update #6

As a refresher for those of you who may have missed out on part of this adventure I’ve been on, I’m attempting to find the optimal diversity value for generating text by simulating a type of neural network called a General Adversarial Network, where two models, one a generative algorithm and the other a discriminative algorithm (something that classifies data) basically work to improve each other. My simplified version of this is using a headline classifier (that classifies headlines as real or fake) to measure the overall percent of false positives in a set of 100 (I’ve decreased the size of this sample for performance reasons) headlines generated with a randomized diversity. That way, the average diversity with the highest percent of false positives (trick the classifier into thinking the fake headline is real) is the best value to use. The first time I attempted this experiment, the classifier told me that of my 1000 generated headlines, 750 of them were real, meaning it only had an accuracy of around 25%, which is terrible. So, I decided that the only solution was to build a better classifier with more data, which requires a better generator to generate 50,000 new headlines of various lengths.

Using the all the news dataset from Kaggle, which contained three files of 50,000 news articles each, I decided that to train my new classifier, I would use 50,000 of those real headlines, and generate 50,000 fake ones using an improved generator trained on one of these three files. Whew! That means the classifier is trained on a total of 90,000 headlines (90% training, 10% testing) of varying lengths, which is nearly 6x the number of headlines as before!

One of the issues with my previous classifier was that all of the fake headlines I used to train it were of length 5 words, meaning the classifier might have been using length as a feature and therefore classifying all headlines of length 5 as fake. I solved this problem by adding an ending token to each headline, so that the generator would naturally cutoff headlines once it reached this token. This led to headlines of various lengths, including ones as short as 3 words and ones as long as 15.

The other huge issue with the first generator was that it just didn’t have enough training data. This caused overfitting, repetition, and generally boring or samey headlines even when the diversity value increased dramatically. So, I decided to use 50,000 headlines as training data for the generator as opposed to 8,600, which is a pretty significant increase. In fact, this increase was so significant that when all the headlines had been split up into sequences (a total of 465,240 sequences) and fed into the model, there were over 5 million parameters and the fitting time for the model was over 24 hours…

No From Me

In order to decrease the model fit time, I randomly shuffled the headlines and truncated the latter half, leaving me with 25,000 headlines, closer to 2.5 million parameters, and a fit time of 8 hours. This was still a long time, but it was much more optimistic than 24 hours and doable overnight. Unfortunately, improving and re-running my “GAN”, which was supposed to be a one to two day endeavor, ended up taking around a week. I ended up creating multiple models, each taking 8 hours, until finally settling on a new architecture that included two smaller LSTM layers and two dense layers (you can read more about this process on my website)

I then took another look at exactly how I was generating my fake headline dataset. Before, I would generate five headlines with the same seed text and different diversities for each word, starting with a set of 1.2 through 1.6, and then five more with that same seed text and diversities 2.1 through 2.5. The problem with this is that those ten headlines all share the same seed text, so out of 1,000 headlines, there would only be 100 unique starting words (out of a total vocabulary of 11,265). Now, when generating my 50,000 fake headlines, I generated groups of 5 headlines with the same diversity value across the board (now a float between 1.2 and 2.9), each with a random seed text. I generated these headlines overnight so that I could build my classifier in the morning.

It took me a couple of iterations to get the classifier working at a level I wanted, but after changing the architecture to more closely resemble the generator, the classifier started working pretty well, with an validation accuracy of around 88% and only 5% false positives when checked on the test data. However, the real test would be to run the classifier against new headlines generated that weren’t in the training or test data, and see if the number of false positives was less than 75% like the last time. So, I loaded in my new generator and my new classifier to the “GAN” simulation, and I generated 100 fake headlines (in groups of 5 with the same diversity and all different seed text) and recorded the number of false positives. Thankfully, the experiment ran pretty successfully! The fewest false positives I encountered was 6%, which is really close to the 5% from the test data, and the most was around 23-25%, which is an incredible improvement over 75%! I decided that these stats were good enough to run my experiment on, and so I tested the generated headlines with the classifier over and over again (I generated over 750,000 fake headlines).

I made a simple method to take the average of the diversities with the greatest number of false positives. The more false positive scores you factor in, the higher that average diversity level rises, which leads me to believe that a higher diversity is actually detrimental, as it corresponds to a lower average number of false positives. I then generated some headlines with diversities between .5 and .9, since they had a greater average number of false positives. Here are a few of those headlines:

  • fenway attacked ezell to rule islamic refugees in japan
  • lafleur police attack kills dozens in france
  • macys is right about iran
  • typo bans adults instagram following ambush
  • deranged teens dzhokhar tsarnaev arrested for parole
  • yahoos challenge why hes going to change north carolina
  • blast police assaulted at lax injuring jihad
  • whiskey driver gunned down in minnesota ihop
  • margaritaville driver continues into saudi border

I’m working on finishing up Cathy O’Neil’s book, “Weapons of Math Destruction,” which is a fascinating and informative read that I’m excited to share with you in my next update. Building these models has been an incredible learning experience, and although my time working on this research for the Monroe project is coming to an end, I definitely plan on continuing. I’m captivated by machine learning, natural language processing, and the ways our bias can play into the algorithms we use everyday. I’ll also be adding the full code for my generator, my classifier, and my “GAN” simulation to my GitHub, after I’ve cleaned it up a bit.