The Lurking Danger of Citizen Data Science
Sometimes when I’m feeling in a bit of a trickster mood, I like to mess with people by telling them I can write a predictive model in less than five lines of code. Technically speaking, I’m not lying. Whereas data science and machine learning are predicated on these often very complex mathematical algorithms, the nitty gritty details have been conveniently hidden away from us.
I’ll try not to get too into the weeds here, but given that you might be a reader who is interested but knows very little about this space, I think it’s worth talking about this idea of abstraction. If you asked me to define abstraction on a more general level, I’d say it’s this idea of giving you what you need to know to do something without having to understand the intricacies of how it works. So a common level example, you probably know how to drive a car. Turn the key, shift into drive, accelerate with gas pedal, steer with steering wheel. Simple enough, right?
If you’re like me, you probably don’t actually know what the car is doing when you push on the gas pedal or turn the steering wheel. And frankly, you probably don’t care to know those details. I know I don’t! All I know is that if I push my foot down this way and turn my arms that way, I’ll handily drive to Taco Bell just fine and get my Mountain Dew Baja Blast (now in the Zero Sugar variety) with no muss, no fuss. It might as well be magic working under the hood for all I care.
Using the language of abstraction, we would say that the details of how a car works under the hood have been abstracted away from us. Car manufacturers have made it easy for average folk like you and me to operate a vehicle despite not knowing the details.
The same principle holds true for every coding language in a computer. At the lowest level, your computer’s processor is basically flickering back and forth between “0” and “1” billions of times a second. (Which is a little mind boggling to think about!) Somehow, we humans figured out how to make sense of this flickering to basically do everything you see when you interact with a device. That’s how this blog post was written. That’s how you’re reading this blog post. That’s how we watch adorable cat videos on YouTube.
This is how coding languages have come into existence. It’s unfeasible to expect people to write these wonderfully complex computer programs just by making sense of the 1’s and 0’s emitted by a computer processor. As time has passed, very smart people have gotten better and better at writing coding languages that continue to abstract away the lower level workings of a computer. It used to be that you could really screw up your computer’s hardware if you made a silly mistake. Now, you really have to be trying to do such a thing. Phew!
Fast forward to 2020, one of the most common practices is for people to create coding libraries that abstract away specific pieces of functionality for specific coding languages. The coding language most data science-y folks like me work with most often these days is Python, and Python has TONS of coding libraries for pretty much anything you can possibly ever desire. Want to plot something on a map using lat/long data points? The “GeoPy” library has you covered. Want to make your data charts look nice and pretty? “Seaborn” has got your back. Need something that will easily help you scrape a web page? Never fear, “Beautiful Soup” is here! (Yes, that is its actual name.)
As you can probably guess where I’m going, there are a lot of coding libraries tailored just for data science and machine learning purposes. I’m talking about libraries like Scikit-Learn, Keras / Tensorflow, Pytorch, and more. These libraries are wonderful tools and make the statement I opened this post with very true. I don’t have to spend the time writing the complex mathematical algorithm because that has been abstracted away from me behind some very simple code. It’s just a matter of properly importing those libraries and popping in my data. Easy peasy.
…a little too easy peasy.
(Side note: I’m typing this post on my iPhone, and gosh darned autocorrect would not stop trying to change the word “peasy” to “peashooter.” I just really want to know why their predictive algorithm thinks “easy peashooter” is more likely to occur than “easy peasy.” Who are these mysterious people typing easy peashooter? I digress…)
I get why these data science libraries were created. The underlying mathematical algorithms are way too complex for most people to code for by hand, and I definitely include myself in that category. When you start getting into stuff like deep learning, I’m pretty sure only a PhD could “hand code” one of those monstrosities, and we’re pretty short of PhDs in the business world. Abstraction via coding libraries is an absolute must because of this.
But here’s where things can go wrong. Let’s revisit our example of driving a car. Like we noted, car manufacturers have abstracted away the inner workings of a car so that a regular person can still operate one with no problem. The big problem is that this abstraction almost allows for too much flexibility to allow people you do NOT want behind the wheel of a car. I’m thinking specifically about drunk drivers. Unfortunately, it’s just as easy for a drunk driver to step behind a wheel as it is for a sober driver. By no fault of the car itself, the drunk driver can do some pretty heinous things as the car operator with unfortunate ease.
In the title of the post, I use this term called “citizen data science.” It’s not a term I made up. Citizen data science refers to this real phenomena that abstraction via data science coding libraries has made it very easy for the average “citizen” to perform data science activities. Just like a driver doesn’t have to understand how a car works to drive a car, people using data science libraries honestly don’t need to understand how the undergirding algorithms work. It’s a bit of a harsh analogy, but in this regard, citizen data scientists are similar to drunk drivers. This isn’t to judge the intentions of these people — which I genuinely believe are good most of the time — but rather to point out that, you know… we gotta be careful.
So how do you get around this? I’ll suggest two very high level policies or procedures. First, ensure you have people working with a thorough understanding of the data you’re interacting with. I’ve talked about this ad nauseam in other posts. This whole business of machine learning is all about finding meaningful patterns from the right pieces of information, and that’s an idea that totally transcends technology. You have to understand how the inputs will affect the output on a pure, business level. I use this phrase often for good measure: garbage in, garbage out.
The second suggestion is to establish proper model validation procedures. Just like “third party” internal auditors comb a company’s finances to ensure accounting methods are in proper order, it is a good practice to ensure you have somebody else looking over a predictive model to ensure it likewise is in working order. Of course, do what makes most sense for your company’s needs. You might have a “tiered” approach where financially significant model is much more highly scrutinized than one that saves you money on, I don’t know… pencils. (Do people still use pencils?)
And with that, I think we’re good to wrap up another post. The way I learned about abstraction was, ironically, a lot more abstract than I would have liked it to be, so I honestly had a very difficult time understanding what that meant for the longest time. I really hope the analogy of the car made it much more simple for you. It’s a neat concept once you get the hang of it, and you’ll no doubt see it in every aspect of your life from now on. Thanks for checking out this post! Catch you in the next one.