Image for post
Image for post

Five Things to Consider When Working with Data

Working with data is easily one of the hottest topics these days, and I genuinely don’t think this is a passing fad as are most things with life. The medium through which we work with data over time may change, but data in its conceptual form has been around since the dawn of time. It’s only been very recently in history that we’ve been able to capture it via electronic means.

We capture that data because we live in a highly patterned universe and can thus derive inferential insights based on past insights. Self-driving cars are an excellent example of this. Anybody who has had a driver’s license for a decent amount of time can tell you that stepping foot behind the wheel and hopping on an interstate to get to a city 100 miles away can often be pretty monotonous. Driving has pretty much become second nature to us because of how repetitive the actions of accelerating, braking, and steering are.

Some smart folks out there have been able to electronically understand through data how those patterns work and build algorithms around them to “teach” a car how to drive itself. That teaching has ubiquitously become known as one facet of artificial intelligence (AI) called machine learning (or deep learning, more precisely). At the heart of it all is statistical learning around lots and lots of data.

So if we can use data to teach cars to drive themselves, it’s not at all far-fetched to imagine how else we might be able to tease insights out of data in other new ways, too.

But working with data can get really murky, really quickly. If you’ve seen anything in the past years about Facebook’s or Google’s use of data, then you know what I’m talking about. And as a burgeoning data scientist myself, showing up with the data science toolbox doesn’t automagically make me qualified to pull good insights from any data on the spot. That said, here are some things I’ve learned along the way about what I’ve needed to consider when working with data.

1. It’s very difficult to glean insights with data without having domain knowledge about the data.

Just like you can’t show up to a job site with a toolbox and expect to know what house to build, you can’t show up with analytical tools and expect to instantly know how to get good insights from your data. I learned this firsthand when wrapping up a project last week as part of the Udacity data scientist nanodegree. I was ready to jump in with my newfound knowledge about supervised learning techniques only to find myself floundering because I had no idea what the data was telling me. I ended up having to take a step back and spending some time actually learning the data before I was able to do anything with it.

And this should make sense. If you’re trying to build a cancer-detecting algorithm, you have to have some level of medical domain-level knowledge. For example, you’d have to be able to know how to properly read an MRI or CT scan before knowing how to build computer-generated insights to do the same thing in the future. Remember, the computer isn’t as smart as you (today), so you have to first learn something yourself before you’re able to teach it to a computer.

2. Your data can be rendered almost useless if you don’t set proper data management / governance around it.

The idea of creating a giant data lake is super popular these days, and for good measure. Going to the biggest pool of insights to build inferences from is extremely helpful. But it can also go very wrong very quickly.

Imagine going about building a house and somebody pointed you to a giant warehouse and said “Everything you need to build that house is in there.” You walk into the warehouse, and everything is a huge stinkin’ mess. Wood is either leaning up randomly against walls or trapped under cinder blocks. Copper, iron, and plastic piping of all diameters are thrown into the same big pile. And how the heck did that box of nails get up in the rafters??

It will take you weeks — if not months — to sort through all this stuff. By that point, you might throw your hands up in futility and walk away from the project entirely. Wouldn’t it be great if it was all neatly organized? You know, the way Home Depot or Lowe’s organizes their warehouses?

The same very much holds true for data. Data lakes can quickly become data landfills if not properly managed. This means things like setting proper data quality attributes, proper metadata tag management, proper information lifecycle management, and more. And just like most things in life, the earlier you can do these things right, the easier it is to manage in the long term.

3. Sometimes, proper data management means doing what seems intuitively wrong.

Okay, I’m sure it sounds like I’m totally going backward from what I shared in the last point, but I promise you I’m going somewhere with this. You see, we as humans learn things differently than the way computers learn thing today. For as much as we’ve come to learn about the way we are educated, humanity truly doesn’t know what exactly makes the brain tick.

On the other hand, we do know exactly how computers work, and that’s obvious why: we built them! Computers today work off lots and lots of mathematical calculations, so the way we teach them to do things (via coding) is totally different than we would tell a person to do the same thing. That said, sometimes proper data management means doing what seems intuitively wrong for the human… but totally right for the computer.

For example, data denormalization is a common data management practice. Where normalization is all about dividing data into nice and neat, intuitive ways for humans to understand, data denormalization intentionally organizes the data differently for optimal use by computers. Whereas data normalization might incline you to place certain pieces of information into separate tables, doing that causes the computer a lot of headache and compute power. And in an age of cloud computing where cloud providers charge based on compute power, data denormalization helps to offset those costs by lumping certain datasets together and not asking the computer to do more than we need it to. (“Thanks human!” says the computer.)

4. You have to consider the ethical ramifications of your data.

Ah yes, you knew this was bound to show up. Where things like information security and data privacy are definitely important, I think we’ve beat that horse enough in other mediums. Rather, what I want to focus on here is how you actually leverage your data to gather insights. You may have heard it said that you can interrogate the data to tell you anything you want it to tell you, and that’s totally true.

For example, let’s say you were building a computer algorithm to scan through resumes to find the best possible candidate for a CEO-level position. The question is, how will you leverage your data to derive ethical insights? You can easily build your algorithm around historical CEO demographics throughout the 20th century, but do you know what will happen if you do that? Your algorithm is almost always going to tell you to hire a middle- to late-aged white male! It’s the the computer’s fault; it’s just building its intuition off what you fed it. But we’ve since come a long way in society to see that all kinds of people would make amazing CEOs, not just white guys.

Just because you have data at your disposal doesn’t mean it’s always wise to use it. Again, you don’t just show up with a toolbox and expect the tools themselves to build the perfect house for you. At the end of the day, HOW you leverage your data will ultimately dictate the kind of insights you’ll end up gleaning.

5. Your data probably doesn’t tell the full story.

For as far as we’ve come, we’ve barely begun to scratch the surface on electronically capturing information in the form of data. With an ever-increasing Internet of Things (IoT) market, we’re now capturing some crazy data we never imagined we’d ever find relevant just a few decades ago. Like… we even have refrigerators gathering data these days!

The truth is that we’re so limited in our electronic understanding of the world that the probability you’re missing a key piece of information that will radically redefine your insights is extremely high. This is why companies like Facebook have gotten into hot water because other companies have realized that Facebook is a gold mine of data and have paid Facebook to leverage that data for their own use. Frankly, both sides of the table are performing unethical activities. Facebook just gets the stronger spotlight as the one holding the “keys to the kingdom” of data.

So the real question is, what ethical choices will you make in light of this fact that you’re probably missing pivotal data? Will you choose to press on knowing your knowledge is limited, or will you coerce the Google’s and Facebook’s of the world into giving you their gold mine of data? To be clear, I don’t think it’s always wrong to continue with limited knowledge. Clearly, we’ve been able to do a lot with a little given the fact that self-driving cars exist. I do, however, think it’s in your best interest to be transparent about whatever you do. (And I’ve written a whole different post about this elsewhere.)

That’s it for this week’s post, friends! It got longer than I thought it would, but I think you’ll find a lot of value out of this one. I know this is something I would have wanted to read myself about a year ago, and we all know this blog helps me more than it helps anybody, right? We’ll see you next week.

Written by

Machine learning engineer by day, spiritual explorer by night.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store