Image for post
Image for post

Starbucks Reward Program Project

Hey there folks! If you’re wondering why this post doesn’t look like my typical set of posts, that’s because this particular post is being done as the capstone project for my Udacity machine learning nanodegree. If you want to skip past this post, that’s A-OK, but if you’re interested at all in machine learning, I might encourage you to check it out. If you want to see all my work for this project, check out my repo on GitHub.

It’s no secret that Im a big fan of Starbucks. I visit Starbucks perhaps two to three times per week. A daddy to two little girls under two years old, I often leverage one of our local Starbucks locations as a getaway means to study or focus on projects… like this one!

Let’s introduce what this project is looking to do. As a means to attract and retain customers, Starbucks leverages a rewards program that honors regular customers with special offers not available to the standard customer. For this project, we’ll be combing through some fabricated customer and offer data provided by Starbucks / Udacity to understand how Starbucks may choose to alter its rewards program to better suit specific customer segments.

A desire to glean insights about customer behavior is one of the hottest uses of machine learning today, and rightfully so. It is often difficult to understand necessarily how certain behaviors influence others, so leveraging things like unsupervised predictive learning models go a long way to better understand customer behavior.

Throughout my professional work experiences, I have had experience working directly with customers in things like focus groups, if not utilizing predictive modeling. (At least, not yet.) Fortunately, I am also a student of Udacity’s Data Science nanodegree program and recently completed the Arvato project on determining customer segments. Coincidentally, that was another offering I could have selected for this particular capstone, but I wanted to focus my sights on something else entirely. Still, the work from that project will serve me well as I seek to pivot the unsupervised learning knowledge applied there to this particular project. (Link to Hundley — Arvato Customer Segments GitHub)

The problem we are looking to solve here is conceptually easy to understand, albeit difficult to answer. We are looking to best determine which kind of offer to send to each customer segment based on their purchasing decisions. We’ll touch more on what these offers are and data we’ll be utilizing down in the next section. We will leverage traditional evaluation metrics to determine which model is most appropriate for our dataset. These evaluation metrics will be discussed in an upcoming section.

For this project, we will be leveraging the data graciously provided to us by Starbucks /Udacity. This is given to us in the form of three JSON files. Before delving into those individual files, let us first understand the three types of offers that Starbucks is looking to potentially send its customers:

  • Buy-One-Get-One (BOGO): In this particular offer, a customer is given a reward that enables them to receive an extra, equal product at no cost. The customer must spend a certain threshold in order to make this reward available.
  • Discount: With this offer, a customer is given a reward that knocks a certain percentage off the original cost of the product they are choosing to purchase, subject to limitations.
  • Informational: With this final offer, there isn’t necessarily a reward but rather an opportunity for a customer to purchase a certain object given a requisite amount of money. (This might be something like letting customers know that Pumpkin Spice Latte is coming available again toward the beginning of autumn.)

With that understanding established, let’s now look at the three provided JSON files and their respective elements:

1. profile.json

This file contains dummy information about Rewards program users. This will serve as the basis for basic customer information.

(17000 users x 5 fields)

  • gender: (categorical) M, F, O, or null
  • age: (numeric) missing value encoded as 118
  • id: (string / hash)
  • became_member_on: (date) format YYYYMMDD
  • income: (numeric)

2. portfolio.json

This file contains offers sent during a 30-day test period. This will serve as the basis to understand our customers’ purchasing patterns.

(10 offers x 6 fields)

  • reward: (numeric) money awarded for the amount spent
  • channels: (list) web, email, mobile, social
  • difficulty: (numeric) money required to be spent to receive reward
  • duration: (numeric) time for offer to be open, in days
  • offer_type: (string) bogo, discount, informational
  • id: (string / hash)

3. transcript.json

This file contains event log information. Complementing the file above, this file will serve as a more granular look into customer behavior.

(306648 events x 4 fields)

  • person: (string / hash)
  • event: (string) offer received, offer viewed, transaction, offer completed
  • value: (dictionary) different values depending on the event type
  • offer id : (string / hash) not associated with any “transaction”
  • amount: (numeric) money spent in “transaction”
  • reward: (numeric) money gained from “offer completed”
  • time: (numeric) hours after start of test

Given that we do not have any labels or ground truth that would enable us to leverage supervised learning models, we will be leveraging unsupervised learning methods amongst our data to determine potential strategies for adjusting the Starbucks Rewards program given our customer insights. Specifically, we’ll be leveraging hierarchical modeling to cluster our data into a few respective customer segments for analysis.

Given that we will be leveraging unsupervised clustering models for our project, we will be using some metrics that enable us to validate our clusters without having labelled data. Namely, we will be leveraging the silhouette coefficient. Because we don’t have labelled data, the silhouette coefficient is appropriate since it produces a score between the range of -1 and 1 based on internal indices. It also happens to be easy to calculate with help from sci-kit learn.

Additionally, we will be leveraging the elbow method of determining k-means clusters through a simple function that will iterate through a number of K-Means clusters and displaying the Sum of Squared Errors (SSE) in visual form. This in conjunction with the silhouette coefficient will idealize the number of clusters for our final algorithm.

Before we get into exploring and manipulating our data, it’s important to explain some of the algorithms we’ll be using throughout the project and how they necessarily work. Keep in mind the solution statement define above: we’re looking to let the data itself tell us how it wants to be clustered. Because we don’t have a “ground truth” of what is necessarily right or wrong, we will have to leverage unsupervised learning algorithms. We will talk about three of them in this section, although we only plan on leveraging two of them in this project.

(Big thanks to this blog for helping me create the explanatory visuals below across K-Means and DBSCAN: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/)

K-Means Clustering

The first and perhaps most basic of our algorithms is K-Means. K-Means works by initializing a number of centroids that serve as sort of magnets to attracting nearby clusters of data around them. These centroids are determined by us as the user. For example, in scikit-learn’s “KMeans” algorithm, we initialize how many centroids there will be by passing in an “n_clusters” parameter.

Each time the algorithm steps through an iteration, each respective centroid moves closer and closer toward the “center” of the clustered data. I’ve visualized this with some very basic data in the three screenshots below. Notice that the initialized centroids do a pretty poor job at clustering the data in the first pass, then do a slightly better job in the second pass, and then finally do a great job in the third pass.

Image for post
Image for post
K-Means First Pass
Image for post
Image for post
K-Means Second Pass
Image for post
Image for post
K-Means Third Pass

Of course, our example here is a radical oversimplification, and it happens to work out nicely that the clusters are so neat and tidy. This definitely won’t be the case with most datasets. In fact, look at what happens when we try to apply K-Means to something like this smiley face set of data, even after several iterations of the algorithm.

Image for post
Image for post
K-Means Smiley After Several Iterations

DBSCAN

Let’s now talk about Density-Based Spatial Clustering of Applications with Noise, or DBSCAN for short. DBSCAN combs (or scans) through our data by randomly selecting a starting point in our dataset and then branching out to other nearby data points. It’ll continue along its current path of clustering so long as the epsilon and minimum point parameters are satisfied. The epsilon value is best thought as the “radius” of how big you want the previous data point to be away from the next one, and the minimum points describe at what minimum number of points should be captured along the way. If these parameters are not satisfied, DBSCAN simply jumps to another random place in the dataset and begins again. This is best described visually, again using our smiley face friend. (Try not to have nightmares with the last one.)

Image for post
Image for post
DBSCAN after ~5 seconds
Image for post
Image for post
DBSCAN after ~20 seconds
Image for post
Image for post
DBSCAN in the end

As you can see, DBSCAN did a much better job with our smiley face friend over K-Means. So you might be asking yourself, if DBSCAN worked so well, then why not use it for our dataset here? Remember what DBSCAN stands for: Density-Based Spatial Clustering of Applications with Noise. You’ll notice I’ve placed emphasis on that last part. DBSCAN works really well on datasets that you know will have noise, but given how concise we’re going to get our our data in this project, I’m not sure will be any noise at all.

Perhaps one final DBSCAN visual will help to illustrate this best.

Image for post
Image for post
DBSCAN on “pimpled” smiley face

Here, we’ve again applied DBSCAN on a “pimpled” smiley face. Notice that these “pimples” were left out of all clusters entirely. Now, if we were to perform a DBSCAN on our dataset, we might be losing out on valuable insights that were incorrectly classified as noise. That said, let’s round out this section with a brief discussion on hierarchical algorithms.

Hierarchical Algorithms

(Note: The same blog above didn’t have a tool for hierarchical clustering. Instead, we’ll use a handful of visuals directly from the Data Scientist nanodegree for explanations. Special thanks to Udacity for this!)

Hierarchical clustering seems similar to DBSCAN in practice but is radically different under the hood. Where DBSCAN clustered elements by sweeping through with epsilon values, hierarchical clustering automatically begins with the underlying assumption that every single data point is already its own cluster. (That said, no data point gets left behind in a hierarchical method.) From there, the hierarchical clustering then leverages different forms of linkage to determine how to best cluster the data. Here’s a very simple visual of how this might look in general.

Image for post
Image for post
Hierarchical Clustering (Courtesy of Udacity)

You’ll notice in the visual a sort of “tiering” amongst the clustering. At the very bottom, you see that, as noted above, every data point is clustered as its own cluster. From there our hierarchical cluster then moves up the hierarchy to determine the next best clustering of our data. Eventually, we keep moving up and up the hierarchy until eventually everything is clustered together in one massive cluster, as indicated by the yellow circle.

Of course, we don’t want either end of the spectrum. We don’t want a million little clusters, nor do we want one giant, massive cluster. Instead, we want just the right amount of clusters, and that is defined by us within the hyperparameters of our model. We’ll see how this shakes out for our own project further on down.

The data in its initial form is decent, but we will need to clean it up some in order to best leverage it for our unsupervised model later on in the project. Specifically, I cleaned up the initial datasets and then later combined them to form a master dataset that we’ll be regularly working from in the remainder of the project. We’ll discuss about that initial cleansing here and talk more about that latter preprocessing down in another section.

Portfolio Clean Up

  • Changing the column name from ‘id’ to the more descriptive ‘offer_id’ since the id column is present in our other datasets
  • One hot encoding the ‘offer_type’ column to work well with our algorithms later
  • Separating and one hot encoding the ‘channels’ column to also work with our algorithms later
  • Dropping the ‘offer_type’ and ‘channels’ columns now that they are one hot encoded in other columns

Profile Clean Up

  • Dropping rows with null information
  • Changing ‘id’ column to ‘customer_id’ name
  • Changing the ‘became_member_on’ column to a date object type
  • Calculating number of days that a person has been a member as a new ‘days_as_member’ column (as of August 1, 2018)
  • Creating new ‘age_range’ column based off ‘age’ column

Transcript Clean Up

  • Changing the name of the ‘person’ column to ‘customer_id’
  • Removing the customers that are not reflected in the ‘profile’ dataset
  • One hot encoding the ‘event’ values
  • Changing the ‘time’ column to ‘days’ along with appropriate values
  • Separating value from key in ‘value’ dictionary in order to form two wholly separate datasets: transcript_offer and transcript_amount

With initial cleansing complete, we can go ahead now and perform an exploratory data analysis on our datasets. We’re going to separate this question into four distinct questions (Q#’s), initial thoughts, visual analyses, and reflective summaries (A#’s).

Q1: What are the general age ranges of our customers?

Given the general hip, young vibe associated with Starbucks, I’m expecting this to be skewed right, meaning that we’ll see more customers in those younger age ranges like 20’s or 30’s. With that thought in mind, let’s view what the dataset actually tells us.

Image for post
Image for post
Age Distribution of Starbucks Customers

A1: The actual age distributions of Starbucks customers.

Well, I was certainly wrong with my initial assessment! This is a classic example of why it’s important to not make assumptions and let the data speak for itself. I suppose now that I think about it, I do tend to see many folks in those 40–60ish age ranges any time I visit. Perhaps it’s because some of them are now retired and enjoy visiting with friends at Starbucks. Perhaps also it is that people in their twenties typically don’t have the money to spend on things like Starbucks. I don’t know; the data isn’t particularly clear on this reasoning. No matter. Let’s move on.

Q2: What are the salary ranges of people across different age groups?

Similar to our last question, I’m curious to see how the salary ranges of these various age groups might affect how often a person visits Starbucks and utilizes their rewards program. Given our first assessment, I’m going to guess that those 40–60 age ranges have the highest salary ranges given that these people are generally further along in their careers and thus make more money to show for it. Likewise, I definitely expect those younger age ranges to be on the lower end. Let’s go ahead and take a look!

Image for post
Image for post
Starbucks Customers — Income Distribution Across Age Ranges (Violin Plot)
Image for post
Image for post
Starbucks Customers — Income Distribution Across Age Ranges (Scatter Plot)

A2: Analysis of Income Distributions

Okay, when first visualizing the data with the violin plots, my thoughts were affirmed that older customers definitely tend to make more money; however, I noticed something odd in the violin plot: there seemed to be a hard cap on the salary range of younger people. Visualizing the data in a scatter plot using individual ages instead of age ranges, I verified that not only are their caps for younger people, but there are caps on everybody’s salaries. Looks like even for older people, the income caps out at $120,000. Theis seems a really odd choice on the data capturer’s part. Clearly, there are many people that make more than $120,000 per year, and there are definitely people in their 20’s and 30’s that make more than that as well. Unfortunately, we’re not clued at all into why these caps were put in place, so we’re just going to have make due with what we’ve been given and note that in our finalization as well.

Q3: What does the correlation between number of days an offer has been open vs. final transaction amount have to tell us?

Honestly, I don’t know what to expect from this one, but it still has me curious. Part of me wants to believe that the longer a reward has been available, the less the person is prone to spend. My reasoning for this is that the customer clearly wasn’t that excited to run out and redeem the reward right away, so the longer time would be a smaller amount because it’s like a “Well, I gotta use it or lose it” kind of thing. My other hypothesis is that we’re not going to see any patterns at all here, that there is no correlation at all here. Let’s go ahead and let the data tell us which concept is right.

Image for post
Image for post
Days Reward Open vs. Final Amount Spent (Higher Level)
Image for post
Image for post
Days Reward Open vs. Final Amount Spent (Lower Level)

A3: Analysis of Amount Spent vs. Days Open

A couple of observations here. First, I didn’t realize that rewards only stretched as long as 30 days. This makes my initial curiosity of “use or lose it” tough to measure because I know as a Starbucks customer myself that I sometimes let my rewards sit out there longer than 30 days.

Second, my initial visualization wasn’t all that helpful because there were surprisingly a lot of outliers of people upwards of $1000 in a single purchase, which is… sort of beyond me to think about. (Are these people buying the full menu…?) Anyway, this wasn’t all that helpful, so I pared down the data in the next visualization. Before moving onto that, it is worth noting that even with these higher dollar amounts, there is no correlation at all between days open and amount spent.

Coming down to our pared down visualization, we again see no definitive ties between how long a reward is open and the dollar amount finally spent. As noted above, I sort of expected this, but I still am glad I got verification on that!

Q4: Do gender distributions have any major effect on our data here?

For our final EDA piece here, let’s take a look at how gender may affect our final analysis. As a reminder, our dataset has indicated three distinct genders: Male, Female, and Other.

Image for post
Image for post
Distribution of Genders Across Age Ranges
Image for post
Image for post
Distribution of Income Across Age Ranges

A4: Analysis of gender across our customer data

Several points of interest here. First, we definitely see more males across this dataset than any other gender category. In fact, the only age range we see more women is in the 80+ category, and I'm going to guess that has to do with the fact that women generally tend to live longer than men.

The other really interesting thing here is the salary distribution for women in particular. Where our dataset indicates that there are more male customers than female (or other) customers, females in this dataset generally tend to have a higher income than men. And across both the primary genders, it's not as if there's a crazy disparity between the average salary, too. This makes me curious that this data has been oddly captured since the general sentiment is that men make more than women. Why is it that this dataset falls counter to that? Unfortunately this is another one of those instances where we truly don't know given solely based on our data.

With initial analysis / clean up and EDA under our belt, let’s move on into formalizing the master dataset I will work from in the remainder of the project. We are going to take several things in mind to engineer some new features that I feel will be helpful when we actually move toward running our unsupervised algorithms. Here are the features we will leverage as part of this master dataset:

  • customer_id: The unique customer identifier
  • age: The age of the customer
  • age_range: The age range the customer falls into
  • gender: The gender of the customer, either male (M), female (F), or other (O)
  • income: How much money the customer makes each year
  • became_member_on: The date that the customer because a Starbucks Rewards member
  • days_as_member: How many days that the customer has been a Starbucks Rewards member
  • total_completed: The total number of offers actually completed by the customer
  • total_received: The total number of offers that Starbucks sent to the customer
  • total_viewed: The total number of offers that the customer viewed
  • percent_completed: The ratio of offers that the customer completed as compared to how many offers Starbucks sent to the customer
  • total_spent: The total amount of money spent by the customer across all transactions
  • avg_spent: The mean average amount of money spent by the customer across all transactions
  • num_transactions: The total amount of individual monetary transactions performed by the customer
  • completed_bogo: The number of completed BOGO offers by the customer
  • num_bogos: The total number of BOGO offers sent to the customer by Starbucks
  • bogo_percent_completed: The ratio of how many BOGO offers were actually completed by the customer as compared to how many Starbucks sent them
  • completed_discount: The number of completed discount offers by the customer
  • num_discounts: The number of discount offers sent to the customer by Starbucks
  • discount_percent_completed: The ratio of how many discount offers were actually completed by the customer as compared to how many Starbucks sent them

We’ve come a long way in this project, and now it’s time to finally get to what we’ve been building toward all along: machine learning modeling. As noted in the proposal for this project, we’re going to leverage some unsupervised algorithms to cluster data in such a way to find commonalities across customer segments based on a number of features. That said, we’re going to take our final dataset (customer_transactions) and drop a few columns just to keep those I think will be relevant to our project. (See accompanying Jupyter notebook for details.)

Feature Selection

Before moving onto scaling our final data, we’ll need to remove some features from our customer_transactions DataFrame. We’ll remove the following features for the following reasons:

  • customer_id: proprietary to the row, it is a wholly unique value
  • became_member_on: a date column that can’t be scaled
  • age_range: a categorical column that can’t be scaled

Okay, so you’ll notice we went a bit haphazardly at going about simply shoving our dataset into our model with little consideration for how it would perform. In this section, we’ll talk about three ways we’ll refine this model in order to deliver more optimized results in the end.

Feature Scaling

In our first pass with our hierarchical model, we went in and ran the model on our data without any scaling of the data. This was an intentionally huge oversight my part. The reason we need to be concerned about this is because the numbers across different columns are NOT created equally. For example, ‘income’ is measured in the thousands of dollars whereas something like ‘average amount spent’ typically hovers pretty low, somewhere around like $10 to $20. And that’s just one example to mention.

Scaling the data sets all the columns on an equal playing field to ensure that no single feature is weighing down the rest of the dataset. Prior to running our data through the unsupervised model again, we will need scale the data so that we are given the best results. To do this, we will leverage scikit-learn’s handy tool, StandardScaler.

Linkage

One thing we didn’t discuss in our previous section on hierarchical clustering was linkage. Linkage is the methodology by which data is clustered, literally determining the best link between defined clusters. In our first pass, we utilized single-link clustering. Here’s a quick explanation on that clustering along with the other various kinds of clustering. (This information is taken straight from Udacity’s Data Scientist nanodegree program.)

Single Link: Used in our initial set, single link clustering looks to cluster different sets of data by linking the clusters together at the point at which the two closest points meet. This is often a poor choice of linkage because the clusters could have “longer tails” that don’t at all reflect a proper linkage. This is best illustrated in the image below.

Image for post
Image for post
Single Link Clustering (Courtesy of Udacity)

Complete Link: Where single link clustering took the approach of clustering based on the points in a cluster being closest to one another, complete link takes the opposite approach in linking clusters together by the points that are furthest away in a group. Again, this is also prone to our same issue with single link and “long tail” clusters, so this also isn’t an ideal choice for our final model.

Image for post
Image for post
Complete Link Clustering (Courtesy of Udacity)

Average Link: Whereas the two clustering methods above took a more narrow approach at defining linkage via a single point, average linking takes into account instead the average of the whole cluster and links based on that instead. We’ll see how this differs from Ward’s linkage down below.

Image for post
Image for post
Average Link Clustering (Courtesy of Udacity)

Ward’s Link: Finally, we have the default linkage and the one we will be using for our final method, Ward’s Link. Ward’s is similar to average in the fact that it looks at all data points in a cluster before making a decision, except it looks to minimize variance by averaging to a single point between the clusters instead of an average between the clusters. If that sounds a bit confusing, don’t worry. The screenshot below definitely helps to illustrate the idea nicely.

Image for post
Image for post
Ward’s Link Clustering (Courtesy of Udacity)

Number of Clusters

Finally, determining the number of clusters we’ll leverage is super in the final analysis of our project. In order to fine tune this parameter, we will perform the elbow method and run a series of silhouette scores. When running these, here is the outcome we get:

Image for post
Image for post
KMeans SSE Scores
Image for post
Image for post
Silhouette Scores

Given this information, I’ve settled on leveraging 4 clusters for our final algorithm. Transparently, I could have gone as high as 9 given the diminishing returns, but I felt 9 clusters would be too many for the purposes of this project. Four clusters is a reasonable amount to comb through in the final evaluation of this project.

Now that we’ve spent some time refining our model, let’s go ahead and utilize our insights to run our model once again for more optimal results. We’ll review these results in a following section. (See accompanying Jupyter notebook for details.)

Before we move on, I want to discuss the benchmarks and metric evaluation since these came heavily into play in the prior section. Given that there isn’t necessarily a labelled right or wrong to the provided dataset, we can’t really objectively evaluate how well our unsupervised dataset performed after it has already processed through the data. What we can do, however, is leverage our benchmark and metrics to determine the ideal number of clusters for the final algorithm.

We already explored leveraging the elbow method and silhouette score in the previous section, so I instead want to focus on our benchmark comparison here. For this project, I chose to leverage insights from a similar project I did as part of the Udacity Data Scientist nanodegree program. That project was the Bertelsmann-Arvato customer segmentation project. (Link to Hundley — Arvato Customer Segments GitHub) Given that that project was also very focused on customer segmentation and already reviewed by Udacity, I believe the insights gleaned there will serve well as a benchmark for this project.

Utilizing a very similar elbow method in that project, here is what the results were from there:

Image for post
Image for post
Bertelsmann-Arvato K-Means SSE Scores

As you can see, the diagram follows a very similar shape to our one from above, including how we might have opted to leverage nine clusters instead of four. In fact, in that project I did end up leveraging nine clusters, but that was partially because there was so much more data in that dataset. Given the relative size of this dataset, I still hold to utilizing four clusters, but this benchmark affirms that our methodology for determining the number of clusters is sound.

Now that we’ve gathered our various clusters from our hierarchical algorithm, let’s go ahead and visualize the results! Like we did in the Exploratory Data Analysis section, we’re going to explore two more high level questions and how these might be utilized by Starbucks to adjust their rewards program.

Q5: What personal attributes of our customers are defined throughout each of our clusters?

First, let’s take a look at the personal attributes of our customers as clustered by our algorithm. We’ll visualize this first in a PairPlot and then further down in more easy to read diagrams.

Image for post
Image for post
Pair Plot of Personal Attributes to Clusters
Image for post
Image for post
Detailed Views of Personal Attributes to Clusters

A5: Analysis of Clustered Personal Attributes

Lots of great insights here! Let’s cover each of the respective clusters within the respective sections below.

Cluster 0

Easily the largest cluster, cluster 0 tended to consist of older people with higher incomes. As evidenced by our countplot with the age ranges, the ages of these folks typcally fell into that 50 to 80 year old range. Additionally, it was somewhat common to see these folks have some of the higher income ranges. Gender was somewhat evenly split between males and females, and I would probably only account the difference as the fact that the original dataset had more men to begin with, anyway. Admittedly, however, because this cluster was the biggest, it had a lot of discrepancies (particularly with income) that make it questionable how much to rely upon it for future inference.

Cluster 1

This cluster seems to be the young person’s cluster. Looking at the age range distribution, we see the strongest distrubution here amongst the 20–40 year old crowd. Interestingly, the gap between males and females is quite large in this cluster. These folks also tend to fall toward the lower end when it comes to income. And as far as number of days as member goes, this cluster’s distribution is very similar to that of cluster 0.

Cluster 2

This cluster here has a couple of interesting highlights. First, this is the only customer where there were more males than females, and this cluster also contained the largest distribution of people who have been members of the Starbucks rewards program for some time. Income tended to be higher here which is not surprising given that our EDA showed us that women tended to make more than men.

Cluster 3

Finally, this last cluster is easily the smallest of our four clusters, yet it bears some striking similarities to cluster 0 in a few ways. Namely, the distribution of days as member and gender are fairly similar. Perhaps the exception that separates it from cluster 0 is that there tended to be more people in the younger age range, and I suppose that would make sense given that younger people consisted of a smaller subset of the original data provided.

Q6: What do the behavioral attributes look like across our clusters?

Finally, let’s wrap up our analysis by looking at the clusters across some of the more behavioral attributes.

Image for post
Image for post
Pair Plot of Behavioral Attributes to Clusters
Image for post
Image for post
Detailed Visuals of Behavioral Attributes to Clusters

A6: Analysis of Clustered Behavioral Attributes

Even more fascinating stuff in here. Let’s cover each of the respective clusters within the respective sections below.

Cluster 0

Lots of interesting things to note in here. First, this cluster on average seems to be the biggest spenders, with an average transaction total peaking at ~$19. I noted earlier on in the report that I felt that this amount is really high, but now that I think about it, maybe it’s on point.

Remember, this particular cluster contains our older customers, and perhaps these customers are buying for multiple people, like family members or friends. Considering that, perhaps that average number makes more sense.

Interestingly, this cluster has the lowest yield of using discount offers and pretty close to last for BOGO offers. It’s also interesting to note that Starbucks almost caps out how many offers they send to these folks. We have a lot of people reporting receiving 2–3 offers, and that number falls off big time after that. Is it because Starbucks is tired of trying to appeal to a group that doesn’t take advantage of these offers? Not to try stereotyping older folks too much, but they do tend toward not leveraging technology as well. Is it because these offers are being managed electronically? More on this down in cluster 3.

Cluster 1

Recalling from above that this cluster tends toward a younger crowd, we see things like the average amount spent spike toward the lower end. Interestingly, we also see that this particular group gets hit hard with offers, spiking big time around 5 offers sent. Taking a closer look, it looks like there’s a stronger leaning toward BOGO offers over discount offers. In both cases, the success percentage for both is about equal, hovering around 40%. This number isn’t necessarily great, and it’s not clear why this is the case. Perhaps the difficulty levels for these rewards are too high for this cluster’s general demographics.

Cluster 2

Our predominantly female cluster, this group also has a high spend amount, much akin to cluster 0. This cluster is hit pretty strongly with a lot of offers, especially BOGO offers. I’m not sure why this is necessarily the case since the cluster doesn’t tend to react well to these clusters, noting the lowest BOGO success precentage at about 25%. I think there’s definitely a lot of opportunity for improvement here given the demographic information of this group and current low success rating across both offers.

Cluster 3

This group is the most curious of all. Recall from the previous section that this group’s demographics closely aligned to that of cluster 0’s with the exception that this group tends a little younger than the cluster 0’s folks. I say that this group is curious because of how high of a success rate this group has with offers, especially with discount based offers. Now granted, this cluster is definitely the smallest of the bunch, but that doesn’t negate the fact that something is definitely working. Perhaps Starbucks could lean into leaning more about this cluster to discover what is causing the level of success.

Before wrapping up our project here, I want to touch quickly on the fact that we could have easily chosen to go other routes when completing this project. Here are two very quick summarizations of what we could have done.

Supervised learning methods: While the dataset itself did not necessarily pose any right or wrong answers, we could have engineered some features that labelled our dataset as such based on things like if an offer was completed or not. In my opinion, this seemed to be way too messy and disingenuous to be considered a viable route. Messy because it’s hard to classify success versus non-success, and disingenuous because it is really relying upon the modeler’s best judgment to determine what that success versus non-success even begins to look like. For those reasons, I steered away from supervised learning methods.

Deep learning: This is beyond my skillset, but I believe that there are neural networks that could have taken a lot of that feature engineering work off my hands to define optimal features for modeling from our dataset without my help. One Udacity reviewer who took a glance at a first draft of this project pointed me to this article about auto-encoders. Looking at the article, it looks like there is a loose coupling of these methods with the Principle Component Analysis (PCA) methods we learned about earlier on in the nanodegree. Again, deep learning like this was far out of scope for our learning purposes, so I am unable to speak to this more. (But I would eventually like to learn more!)

For as much work that I’ve put into this project, it wouldn’t have been unfeasible to take other different approaches. There were certain pieces of data that were intentionally omitted that we just as easily could have used. We could have leveraged deep learning and allow a model like that to determine the features for me. Those are admittedly beyond my skill level at this point, but I still think we ended up with a lot of great insights in the end. I’m glad we took the approach that we did and am looking forward to applying these skills in other projects!

Written by

Machine learning engineer by day, spiritual explorer by night.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store