Where You Get the Biggest Bang for Your Buck in the World of Data
A few months ago, I completed a project for my Udacity Machine Learning Engineer nanodegree around using artificial intelligence to best understand customer demographics and purchasing patterns to make recommendations on bettering the Starbuck Rewards program. (You can check out that post here.) The project began with three JSON files provided by Udacity / Starbucks containing roughly 25 megabytes of structured data. For those of you trying to wrap your minds around what that number means in the structured data world, it’s a fair amount of data. Not too small, not too big.
Now, if you’re not familiar with the whole data science / artificial intelligence world, I would not blame you at all for thinking that I had to do a ton of work to apply the proper mathematical algorithms to create these clusters of consumer data. In ye days of olde, that probably would have been true. But with the help of excellent data science packages like Scikit-Learn, people have encapsulated these AI algorithms behind code in some very convenient ways.
So, would you like to see all the code I wrote to actually build my clustering algorithm in the aforementioned Starbucks project?
If you blinked, you might have missed it!
Yes, for this particular model — and many similar to it — you can create it in just two lines of code: one to instantiate the model, and another to actually fit your data to it. (I like to mess with people sometimes by telling them I can write a predictive model on a whiteboard in just 30 seconds. They’ve never taken me up on the offer, thinking I’m either joking or am a total genius. I’d like to think I’m the latter, but the truth is I am neither.)
Of course, there is a big catch to this whole thing: You can’t just dump in any ol’ dataset and the predictive model to be able to make sense of it with no context. Before I was able to successfully cluster my data with that algorithm, I had to do a couple things to it. And when I mean a couple things, I mean a LOT of things. (You can look at all my code in this Jupyter Notebook on GitHub.) From cleaning and coaxing to merging and filtering, it’s pretty evident looking at my code that the good bulk of my time was spent getting the data into the form I needed it in! If this project is anything like running a marathon, cleaning the data is equal to the months of training for the marathon; the actual creation of the predictive model is the few hours it takes you to run the marathon on race day.
(Actually, I’m really proud of that analogy because just like after running a marathon, you’re tired as heck and super happy that you don’t have to train anymore!)
Quick recap: three JSON files, roughly 25 megabytes of data. Cleaning this data took me several hours of writing code where writing the code to create the model was about 30 seconds.
Now let’s talk scale.
I worked with megabytes of data, but it’s super common to work with gigabytes, terabytes, and even petabytes of data. The files were handed to me by Udacity on a silver platter… which is most definitely not the case for most companies. And if you think the way one person defines what a customer is in one dataset matches the definition of a customer in other datasets… good luck, my friend.
Given these constrictions and restrictions, how long do you think it would take for a data scientist to even begin prototyping a predictive model? Days? Weeks?
What if I told you it could take as long as months?
Yes, folks, that is not a joke. It is not uncommon for a data scientist to take months to get a working predictive model out the door just because of how long it takes to get her hands on the data.
Think about it this way: data is raw material akin to how wood, nails, and pipes are raw materials to build a house. Wood and nails are combined to build support trusses, and support trusses are used to build the roof on a house. So it goes without saying that no wood + no nails = no roof. In the same way, no data or lack of data means no predictive model.
If you’re a startup, it’s tempting to start cobbling random datasets together to quickly push an ML model out the door into production. That might work fine in the short term, but think back to scalability. If you were building your own house, running to Lowe’s or Home Depot each time you needed a specific construction material might be okay. But if you’re looking to build a whole neighborhood of houses, running to the hardware store every time you need something becomes a nightmare of waste.
Do you see where I’m going with this?
What is going to provide your company the most value in the world of data is actually getting your data in order. This means ensuring consistency between datasets, properly managing your metadata, establishing lifecycle rules on your datasets, and more. Just like you would inventory and organize your construction materials in a warehouse, so should you properly manage the data in your organization.
You might be groaning reading that last paragraph, and I don’t really blame you. Frankly, it’s unsexy work. Organization in and of itself doesn’t provide a whole lot of value, so it can be difficult understanding how the end product of your work will eventually manifest itself in a machine learning model. But the more you put this off, the harder it is to get in order down the road. Imagine those houses you see on shows like Hoarders. Not disparaging those people at all, I think it naturally makes sense to understand that it would be much easier to clean up if they’d have kept their homes clean in the first place.
I entitled this post using money language for a reason. Money talks, and we closely associate value with those products at the end of the supply chain. In the data world, revenue is often closely tied to some specific AI/ML predictive models. This business of organizing data falls upstream and therefore largely gets ignored. But what I think people sometimes fail to realize is that not only is the upstream data source feeding your downstream model, but it also enables future downstream models.
So while the big money might come from a general social networking algorithm, you might be able to use similar data to create a dating algorithm, effectively doubling sources of revenue. And over time as you add more and more predictive models fed from the same upstream data source, the ROI on properly organizing your data goes through the roof! (Why do you think Facebook and Google are so successful?)
I hope that gives you some food for thought if you work in the IT sector. I’ve heard it once before that “data is the new bacon,” and I agree with that statement… so long as it’s in a readily usable form. In a raw, messy form, your data can be rendered almost totally useless. Say goodbye to those fancy ML predictive models you were dreaming up.
Let’s wrap it up there. If you appreciated this post, I’d invite you to check out a few other posts I’ve recently written on why your ML model is performing poorly, speaking the language of AI, and five questions to ask yourself before jumping into the AI world. Hope y’all enjoyed this post, and I’ll catch you in the next one!