For as much as machine learning (ML) is a science, you might be surprised how much it can often look like an art. You can’t just drop data in at the beginning of the funnel and expect your computer to work its magic! Data often needs massaged and coaxed in such a way that it is finely tuned to work well with whatever ML algorithm you’ve selected. It’s a fine dance that requires you to be intimately familiar with the content of your data. One could argue even that subject matter expertise is more important than having the machine learning skill set itself.
As somebody who is still very much on a learning path toward a data science role, trust me when I say that I’ve run into my fair share of model performance issues. Actually, I’m pretty sure I’ve run into almost every issue on this list! It can be really disheartening when you pull together what you think is a masterfully crafted model only for it to give you 50% accuracy. I’ve got a coin in my pocket that will do a better job than that!
One thing before jumping in… this post is intentionally written both for folks newer to the machine learning space and also those people who have zero experience with machine learning and want to learn more. That said, I might cover some concepts at times you may already be familiar with. I’ll do my best to strike a good balance for both parties. Fair enough?
Okay, we’ve got a lot to chug through in this post, so let’s get into it!
1. Your dataset has some pretty big outliers or widely varying ranges between features.
The way I help people demystify ML and artificial intelligence as a whole is by telling them that it’s all essentially some pretty fancy math done by your computer. I don’t mean that at all in a derogatory way, but let’s face it — we have to call a spade a spade. So when you have an ML algorithm looking at one feature with a 0–10 range and another with a -1,000–2,000 range, it’s not going to treat them on an equal playing field! Throw in a bunch of other dimensions, and you likely will have a wide distribution of wildly differing distributions in your dataset.
If you do nothing about this, the ML algorithm will almost certainly give undue favor to certain dimensions over others, and we call this introducing bias. (Remember that because you’ll be seeing more of that in this post.) Just like a parent who gives one kid 50 presents and the other 3 kids a single gift and then tries to tell you they’re not biased… well, actions speak louder than words, my friend!
There are many ways to deal with this, and perhaps the two most popular are standardizing and normalizing. Standardizing uses the mean of the column to give each individual data observation a new value based on its standard deviation from that mean, and normalization squeezes the data from its current range to somewhere between 0 and 1. There are uses for both, and packages like Scikit-Learn make it very easy for us to implement.
2. Your dataset contains a target class imbalance.
Okay, true story, this is the one that has personally plagued me the most which is silly considering how simple it is to diagnose! In ML algorithms where we want to classify something based on historical data, we feed in that historically labeled data for training what is called a supervised learning algorithm. So if you want to predict which flavors of ice cream people buy at certain times of the year, you’ll feed the ML algorithm (pun intended!) historical sales data from previous years.
Remember our old friend bias? Well, your historical data might contain some anomalous information that you don’t want adversely messing with your predictive results. For example, let’s say you — an ice cream shop owner — develop a new flavor called “Beam Me Up, Peanut Butter Cup” in honor of the “Storm Area 51” event that recently took place. Your sales skyrocketed in the month of September 2019 as everybody flocked to try this new flavor.
If you dumped this data in that basically says “Everybody loves Beam Me Up, Peanut Butter Cup in the month of September,” well… guess what’s going to happen in September 2020? I’m guessing this “Storm Area 51” event isn’t going to happen again next year, so if you rely on your ML algorithm and stock up on Beam Me Up, Peanut Butter Cup… you’re probably going to be sorely disappointed.
The way to get around this is by resampling your data properly. In this case, you’ll probably want to downsample the anomalous sales data of Beam Me Up, Peanut Butter Cup in September 2019. (Why did I have to create such an obnoxiously long name for that flavor…?) In other cases, you might need to over sample some underrepresented classes. One popular technique for doing this is called Synthetic Minority Oversampling TEchnique, or SMOTE for short. I’m not going to get into that here, but you can learn more about that at this link.
3. You’re measuring your results with a less than ideal metric.
Earlier in the post, I mentioned accuracy in passing, but that is just one metric by which we judge how well an ML model is performing. And truthfully, that metric is probably the least helpful when taken by itself.
I’m not going to cover the full gamut of metrics we use in the ML world, but perhaps two of the most popular include precision and recall. Precision is the idea that you really want your ML model to perform in a certain way even if it means that it intentionally misclassifies things every now and then, and recall does the opposite and seeks to ensure that we’re minimizing those false positives.
So if you’re trying to determine to send a flyer via mail to a potential customer but it’s really expensive to do so, you might seek to maximize recall. On the other hand, if you’re just wanting to send them an email that has no cost associated to it, you might maximize precision instead.
(There are even metrics to help balance these called the F1 and F-Beta metrics, but we won’t cover those here.)
The various implementations of ML in the Scikit-Learn package allow you to select which metric you want to target, so be sure to read up on the documentation to find out how to adjust it. If you don’t adjust it, it’ll rely on whatever default setting it has, which might not be the best for you!
4. Your model is underfitting or overfitting to your training data.
Last week, I proudly passed my AWS Certified Machine Learning Specialty (woohoo!), but the struggle I ran into was that the test is so new that there are hardly any practice exams out there today. I took this one exam over and over and over and found myself with an odd problem… I was memorizing the answers! This proved to be unhelpful because memorizing “A, A, B, D, C, B, D…” over learning the general concepts is helpful for nobody. (And this is why I have a hard time recommending you study for that exam until some additional materials come out. That test was BRUTAL.)
That situation there is a classic example of overfitting, and that with its cousin of underfitting can prove maddeningly frustrating to deal with in an ML context. Ideally, we want to strike a balance where our ML model is learning general patterns amongst the data without boldly memorizing the dataset in its entirety. Remember — we don’t want to memorize the multiple choice answers on a test; we want to learn the concepts that will help us pass a test with a difference sequence of multiple choice answers.
Frankly, this can be one of the most difficult problems for a data scientist to diagnose and rectify. If you think your model is overfitting, you can stop training at a sooner point or reduce the dimensionality of your data with something like Principle Component Analysis (PCA). Conversely, if you think your model is underfitting, you might just need to feed it more similar data. Those are some simple things to consider, but of all the issues we’ve examined so far in this post… I think this one takes the cake for the one that frustrates me the most.
5. You don’t actually have data that represents a predictable pattern.
Oooo… this one can feel like a punch to the gut, but it’s something we certainly need to acknowledge. Machine learning models are NOT magical! Like I said earlier, they’re basically executing some very fancy math in your data. So if you’re expecting to drop in some data and have the computer magically tell you it can reliably predict a given event, you’re going to be sorely disappointed. (Funnily enough, basic metrics might tell you it can reliably predict an event, but dropping it in a real world context will quickly show you those metrics had no idea what they were talking about.)
This comes back to the whole science vs. art thing I posed at the top of this post. Machine learning can be a difficult thing to learn, but at the end of a day, it’s just another tool in your toolbox. If you think you’re going to carve a statue with a hammer and no chisel, you’re in for a bumpy ride. Likewise, these ML algorithms are constrained by how you use them. Dropping in data about how often kitties meow isn’t going to predict how often I drink Mountain Dew. (Although wouldn’t that be hilarious if it did? 😂)
Well friends, we made it to the end of another post! I actually had a few other ideas of things to add to this list, but it already got crazy long as is. I’ve given you enough to think about for one day! For my ML practitioner friends, what might you add to this list? See you all in the next post!