Five Less Common Python Packages I Like for Data Science!

David Hundley
DataDrivenInvestor
Published in
6 min readNov 12, 2019

--

When it comes to data science, Python is definitely my go-to programming language for all things predictive modeling. My first programming class in high school taught Visual Basic, and I later learned Java as my next programming language. Neither of those languages were necessarily awful, but their syntax compared to Python can be a little confusing at times. (Curse you, semicolon!)

But perhaps my favorite reason for using Python is the robust support it gets from ancillary packages created by all sorts of people. If you’re active in the data science community, then I’m certain you’re familiar with staples like Numpy, Pandas, and Matplotlib. Those packages are awesome; they’re popular for a reason! If I’m not typing “import pandas as pd” the second I open a new Jupyter notebook, then what am I living for??

In this post, I want to give a little love to some Python packages that don’t get nearly as much attention as the big guys listed above. Maybe you’ve heard of some of these, and maybe you haven’t. They might not have as far a reach in functionality as those other packages, but they are masters of their niche. My hope is that you walk away learning something you can apply to your own practices.

Let’s get into it! 🐍

Dask

If you’re familiar with open source projects like Apache Airflow, then you’ll be familiar with the core concept behind this package: parallelism through lazy processing. If you’re not familiar with that concept, the basic gist is that lazy processing waits as long as it can before executing certain functions so that it can process things in parallel appropriately. The illustration I created above helps explain this the best. You’ll see that Functions 1a and 1b are dependent on one another whereas Functions 2 and 3 are independent. Using baseline Python, the functions will process one after the other in sequential order. Using Dask, you can mark the functions appropriately and have them executed in parallel, thus cutting down on processing time pretty significantly.

This is pretty awesome if you have big transformations to execute on large / complex datasets. The one thing you’ll have to watch here is memory usage. Dask can be great, but it can quickly become a memory hog if you’re not careful. If you optimize your memory usage alongside Dask, you’ll find this package to be your best friend in data processing! 🏃

Learn more about Dask here.

Imbalanced Learn

One of the biggest challenges with many classification datasets is that you’ll almost never have an even balance between target classes. This is certainly true in cases of bank fraud, cancer detection, and more. The question is, how do you handle it? There’s no one-size-fits-all answer to that specific question, but it will likely fall somewhere in the range of gathering more data or appropriately resampling your data.

Speaking of resampling, one popular technique is to synthetically create samples to level the playing field on behalf of your underrepresented class. This is what is called Synthetic Minority Oversampling Technique, or SMOTE for short. The Imbalanced Learn package helps simplify implementation of SMOTE with many different easy-to-use Python functions. Imbalanced Learn offers a broad-sweeping suite of stuff beyond just SMOTE, but SMOTE is how I’ve used this package most often!

(Also, is it just me, or does that visual from above remind you of Venom from Spider-Man? 🕷)

Learn more about Imbalanced Learn here.

Featuretools

One of most time consuming pieces of any data scientist’s role is putting data features through their paces to glean more information from them. This could be as simple as analyzing mean values of a single feature to combining features across multiple datasets to glean entire new insights from them. Don’t you wish there was a simpler, much quicker way to get at a lot of insights more quickly?

That’s where Featuretools comes into play! Featuretools enables you to quickly associate some datasets together and peform a “Deep Feature Synthesis” with a target entity in mind and quickly churns out a TON of new features for you. Granted, you’re still going to have to do some manual processing to get really fine grained insights, but if you’re looking for something quick and dirty, Featuretools is the package for you. 🔬

Learn more about Featuretools here.

Pylint

Okay friends… I have to admit, I’m not exactly the cleanest coder when it comes to initially dashing out the gate into a new Jupyter notebook. I was just looking at some former work this morning and noticed that I imported the same Scikit-Learn package in three different places at different points in the same notebook. 🤦🏽‍♂️

When you get into the heat of things, it’s easy to lose track of what you did. Whether it be importing the same thing multiple times like I did or mislabeling one of your variables, these little mistakes can quickly add up if you try to push them out into production. Combing through your code manually can help, but hey — we’re human. We’re all prone to missing some stuff, so this is where Pylint can come to our rescue. Pylint automates all that cleansing business by combing through your data and making very specific suggestions as to where you might need to make some adjustments. It’s much quicker than going through your code manually, and it’s more apt to catch things than you as a human are! 🧹

Learn more about Pylint here.

MLFlow

MLFlow UI

Of all the packages on this list, this one is the one that is personally the newest to me. It’s not been around that long (maybe like a year at the point I post this), but it’s already getting a lot of love from the machine learning community and for good measure! When it comes to optimizing a machine learning model, there really isn’t a great way using something like Scikit-Learn to measure the results of using certain parameters to the metrics you’re measuring them with. MLFlow helps remedy this by adding a simple syntax that not only logs your parameters, metric outcomes, and model itself but also enables a nice UI as seen above to help you compare all these results. Neat!

But wait, there’s more! MLFlow also has some stuff built into it for you to quickly serve up your model for use in both batch and synchronous inferences. As I noted before, this package is very new to me, so I haven’t been able to get this working myself quite yet, but there are plenty of folks on the internet who have. I’m excited at the promise this package shows, so you can be assured I plan to keep trying to get this functionality enabled. 👍🏽

Learn more about MLFlow here.

Okay folks, that about wraps up this post! I hope you walk away with something new that you can use in your data science arsenal. What are some of your favorite lesser known Python packages. Be sure to sound off down in the comments. Until next time, happy modeling!

--

--

Principal machine learning engineer at a Fortune 50 company, 5x AWS certified, 2x HashiCorp certified, 1x GCP certified, M.A. in Org Leadership, PMP, ChFC, CSM