Hey there folks! As I’ve now been a practitioner in the data science space for a couple years now, I thought I’d start documenting some of the lesser known things I’ve come across that have been a huge help in my professional journey. These posts are going to get into the weeds on smaller topics that you might not come across in a traditional classroom setting. I don’t at all blame classroom settings for not covering all these things as it would be too overwhelming for a beginning student to learn all at once.
These posts are going to cover topics that range from more intermediate to advanced tips, but I’m intentionally going to target them at a level that anybody could understand if they desired. That said, I’m going to start off this particular post by covering what one hot encoding is before delving into our tip on how we can simply reverse it. If you already know what one hot encoding is, feel free to skip on down past this next section!
What is one hot encoding, and why is it necessary?
I don’t know about you, but I’m the kind of person who learns best using concrete examples. We’ll start off here with a simple glance at categorical data. I’m going to quickly generate a Pandas DataFrame that contains a single feature — “animals” — with a number of different animal types as values. (You’ll notice a few repeats in here, and that is intentional.)
Now, I’m not going to spend too much time teaching about machine learning in this post, so I’ll do my best to summarize it quickly here. In a nutshell, machine learning (ML) basically uses a bunch of fancy math to draw generalized patterns out from your data, and as you might expect, math needs to work with numbers. This doesn’t bode well for our “animal” feature here it is using text-based categorical data. If we tried to feed this feature as is to an ML algorithm, it would fail to run.
So how do data scientists resolve this?
There are a couple ways. You might be asking, “Why can’t we arbitrarily assign a number to each type of category seen in the feature?” That would introduce an undesired bias because the numbers would produce a hierarchy. Just for fun, let’s see what happens when we attempt to do that.
Good news: our animal data is now in numerical form. But… is that really good news? You see, birds now have an assigned value of 5 whereas cats have an assigned value of 1. Does that mean birds are 5x better than cats? Or are monkeys lesser than elephants? This sort of method we just applied here only works well with ordinal data. Ordinal data has an inherent order to it where it makes sense that one number might be higher than another. For example, thinking about the different sizes of coffee you can buy at your local coffeeshop, it would make sense that a “tall” coffee has a lower number than a “venti” coffee.
Unfortunately, this option doesn’t work well for our animal data since there isn’t any inherent order between our animal types. We need another option!
That’s where one hot encoding comes in. One hot encoding takes categorical data and spreads them out cross their own respective columns, and each respective observation in the new column only lights up as 0 or 1 based on the value in the original column. Bleh… I know that’s a mouthful. Learning one hot encoding is MUCH easier when seen an example. In the screenshot below, we’ll make use of Pandas’ “get_dummies” method to produce one hot encoded data.
Voila! It’s as simple as that. Once we drop the original “animal” feature, we now have a dataset that an ML algorithm can nicely work with. Did you also notice how one hot encoded handled animals that showed up more than once in the original feature? In the new “bird” feature, you can see that it simply returned a 1 for each time bird showed up and 0 for everything else. Nifty!
How can I reverse the process of one hot encoding?
In a data science classroom setting, you almost assuredly will be taught about one hot encoding as we have covered above, but what’s far less likely is that they will teach you how to reverse the process. There are a few reasons why not. In our simple example above, we don’t really need to figure out how to do this because our original feature is still intact. The work is already done!
Still, there are a very small set of use cases where reverse one hot encoding is necessary, and I wouldn’t be writing this post if I didn’t come across one myself. I recently worked on a project that took a large dataset that parceled out different parts of the dataset based on different condition and passed them to their own respective ML models. After it was all done, I needed a quick way to be able to re-combine the results into a single, new feature that basically said, “Yup, it went through this particular model.” In my case, there was no original data to draw from.
The Internet suggested a number of different ways to do this. All of them worked, but some were more computationally intense than others. For example, you could use a “for-if” loop to iterate through everything, but that would be highly computationally intense and frankly a PITA to write. Thanks to my good friends on Stack Overflow, I found a MUCH simpler way of going about this.
First, let’s carve off our original “animal” feature just to show we aren’t cheating here.
Now what we’re about to do here is so fast that if you blink, you might miss it. We’re going to make use of a super handy method in Pandas called “idxmax.” Per their official documentation, it “returns index of first occurrence over maximum over requested axis.” Again… that’s mumbo jumbo to me. This is another thing that’s much easier to see in practice.
How simple is that?! Given how computationally efficient Pandas is, I’m assuming this is much more efficient than writing your own loop. Plus, this is a TON easier than writing your own loop to do this. It worked out here that this DataFrame only contained OHE columns, but you can very simply specify idxmax to only apply to specific columns.
And that wraps up this first data science quick tip. I told you it’d be quick! Truth be told, the first time I attempted to learn this, I started to go down the “for loop” rabbit hole, so I wish I would have known this when I was first attempting to perform reverse one hot encoding. In fact, pretty much all posts going forward will be me writing to a former me that wishes I could have seen all these tips sooner!
UPDATE: I later realized it might be helpful if I included my working code alongside these posts. That said, I’ve created a new repository on GitHub that will house my code for each post in it’s own respective directory. For this particular post, I simply used a very basic Jupyter notebook, and you can find that at this link.