Introduction to the data
In my NLP class’s final project, I chose a dataset of Amazon video game text reviews coupled with their star ratings. I’m an avid PC gamer, and you can catch me playing a lot of simulation racing games nowadays. I wanted to explore the language used to rate video games from Amazon reviews. This dataset included physical games, consoles, and gaming peripherals. The goal of the project is to classify the text review to match the associated star rating. This will hopefully mean we can accurately classify what is a well-received game and understand why by looking at the language used to describe the products.
This dataset posed several challenges. My main concerns were data quality, volume (~1.5 gb), and how to handle sections of missing data. The dataset included things that were loosely categorized as gaming related products. It also included problematic data points with missing data, reviews without text, and titles likely from scraping anomalies.
Starting with the data quality issues, this labeled dataset fortunately had a column for the main category of each product. In this case, something categorized as Toys & Games could be a board game that ended up loosely labeled Video Games labels. Another example is a steering wheel titled “90–93 Toyota MR2 NRG 350MM Steering Wheel + Hub + Quick.” It probably ended prematurely for quick release.
This is a 90–93 Toyota SW20 MR2.
And this is a simulation racing rig.
Close, but not quite the same thing. Cull them!
Next, some rows had no text or no ratings for analysis and labels, respectively. Another easy candidate for removal.
Finally, some titles had scraping errors that made them appear with HTML text as opposed to a product title. All of these were programmatically removed alongside the previous noted changes to have a more manageable, cleaner working dataset.
Moving along into data exploration, my initial goal was to see how the labels are distributed to and how that may affect my proceeding text embedding and modeling thinking.
Figure 1: Word Counts Histogram
Figure 1 shows the large majority of reviews have between 0–50 words. The pandas describe function shows the lengthiest, most passionate review had 32,721 words. For text analysis, this immediately clues me in that very short reviews and very long reviews may not be that useful.
Figure 2: Ratings Frequency
To my surprise (but not really), nearly 70% of reviews fell into the 4 or 5 star categories. This is going to be problematic because any model can achieve a 50% accuracy by simply predicting a 5 star rating for every review, and it’s still better than a random guess (20% accuracy).
Figure 3: Sample of categorizations with main category, “Software”
As mentioned in my challenges section, some products were not accurately categorized, and figure 3 shows a sample of that. These are indeed video games in some sense, but they are a very broad categorization that is also software. Further down the list and not shown in the figure, the more problematic entry was “Adobe Photodeluxe 2.0” which is surely not a game. Short of manually going through the 2.5 million entries, there will be some noise in my final dataset.
In the end, I ended up filtering out reviews with less than 100 words and leaving in all the long reviews. My thinking was that stop words and stemming would remove the irrelevant material. I chose to also remove rows with no ratings (no labels) and rows with no text at all to end with a total count of 1.64 million rows of reviews.
I first parsed the text data by removing any punctuation, words with less than two letters, and words with more than 21 letters. This is an arbitrary way to keep words that are commonly known to most people to have some easier basis for the embeddings to be closer to one another. Next, I removed common stop words via NLTK’s English stop words. Then I removed any upper case letters in each word token to further simplify analysis. Finally, I chose to turn on stemming via the PorterStemmer package. These changes were mostly considered for processing speed at the time and the first attempt at prototyping text models.
I tried both TF-IDF and GloVe Embeddings with both 100 and 200 dimensions for my use. I decided to forgo BERT, Word2Vec, and fastText again in the interest of time. I had previously used GloVe before, so I had code ready to transfer for this situation. In hindsight, I should have used Word2Vec as my default because I was using monograms and not bigrams.
The default choices for text modeling at the time were recurrent neural networks. TF-IDF embeddings were first used to check if a simpler approach could work better. I then chose GloVe embeddings as the more complicated embedding choice. Both approaches used 100 and 200 dimension embeddings.
I initially planned on starting the modeling process with the simplest and fastest approaches, random forests, then moving towards the more complicated, slower approaches, neural networks, if needed. I experimented with all four embedding combinations inputted into default random forest models, 1–2 layer dense neural networks, 1–2 layer recurrent neural networks, and a 2 layer 1D convolution neural network.
For evaluation, I decided to go with F1 score over simply accuracy, or just one of precision or recall. As noted earlier, accuracy could be fairly high if the model simply outputted 4 and 5 correctly while guessing randomly for the rest. This could result in ~75% accuracy with my back of the envelope calculation: 50 (5 stars) + 20 (4 stars) + 0.33 * 30 (1, 2 or 3 stars). Optimizing for precision and ignoring false negatives is an oversight in this context because the majority of predictions could easily lie outside of the true positives and false positives. Optimizing for recall is not necessarily desired either because predicting a 4 that was actually a 5 is not that far off. In the end, you still get a positive review. So, a balance between precision and recall is desired here in my opinion.
I decided to forgo tuning the random forest models because they overfit to the training data severely and produced very poor test F1 scores of about 0.04. The next ones were dense neural networks which still overfit to my training data but improved the test F1 scores to about 0.330. The 1–2 layer dense neural networks reported similar test using both TF-IDF and GloVe embeddings but did not overfit using TF-IDF embeddings.
The next set of LSTM experiments used regular LSTM and bidirectional LSTM recurrent neutral networks. These models normalized much better and showed little improvement when adding layers. They topped out their test F1 score at about 0.410. In these cases, the GloVe embeddings performed much better but still overfit.
The 1D CNN experiment trained very quickly and did not overfit as much as the RNNs but reported a much poorer test F1 score of 0.278. Finally, I tried one ensemble model that combined my best LSTM, dense NN, CNN, and bidirectional LSTM models weighted heavily towards the LSTM models to achieve a slightly improved F1 score of 0.417.
These F1 scores are not great, and I would definitely not say this model is ready for use.
What I’d Do Differently
While I was under a time crunch, I certainly made several assumptions and committed to various oversights during the process. If I were to expand on this project, I’d start by exploring the following:
- Bin 1–3 star reviews together, and separate 4 and 5 star reviews.
- Sample each label more equally by downsampling 4 and 5 star reviews. I’d propose this to give the model a more well-rounded view of poorer reviews.
- Do more data exploration and consider dropping poor reviews that are along the lines of “did not receive my product” because these are of no fault to the product itself but are commonly written 1 star reviews. There were also so many different types of games in the dataset that it was never going to go well.
- Consider a clustering analysis first — I believe the level of abstraction within this dataset is far too broad at the moment. Combining consoles, video game software, and all video game genres is a huge oversight on my part. Different genres have different languages on their own, and reviewing consoles definitely has a different set of language compared to reviewing games.
- After clustering, experiment and run different models on each cluster. It may matter that different genres each have their own language, and a more nuanced model could be better even if it’s worked on with less data.
- Learn and implement BERT over GloVe. Video games definitely have their own lexicon that may not and I’d guess absolutely does not align with how news articles are written.
- Redo text parsing — punctuation matters. Uncommon words could have been eliminated because they do not appear much in news articles. This is another reason to use dataset specific embeddings over a pretrained set.
GitHub Repo: https://github.com/SpecCRA/453_final_assignment
Python Libraries used:
- TensorFlow 2.0, Keras