Analyzing AirBnB Data in the Lovely City of Seattle

David Hundley
6 min readAug 14, 2019

Hey there folks! I’m completing this post on behalf of my data science nanodegree for Udacity, so it might have a different feel from my regular set of posts. Still, I encourage you to stick around and check it out if this sort of thing interests you. You can also find my accompanying work for this post in this GitHub repository.

The world is a changing place, and it’s interesting how many of the top emerging companies are centered on industries with very few assets. Uber has few cars, Netflix has few physical DVDs anymore, and of course, we have AirBnB. As opposed to the big hotel chains with their massive amounts of real estate, AirBnB has successfully made a name for itself by serving as a broker between hosts willing to share their respective properties and guests looking for a nice getaway experience.

In this post, we’ll be looking at data related to AirBnB in the beautiful city of Seattle, Washington. There are lots of things we could have done to assess this data, but for brevity’s sake, we’ll focus ourselves on three particular areas. If you’d like to follow along with my work, be sure to check that out in my GitHub repository for this project.

Without further ado, let’s get into it!

Question #1: What are some of the features of places with the most reviews?

Okay, I’ll be honest: I’ve not actually used AirBnB myself, but I know through traditional books of hotels that I like to look at places with the most reviews and what those reviews had to say about the places. I don’t know if it’s necessarily the smartest approach, but I know many share my same sentiments: there’s something about strength in numbers! If more people are staying at a given place, they have to be doing something right, right? Below, I’ve picked out the top 5 places and displayed the attributes of them appropriately.

Attributes of properties with the most reviews

The image above displays some of the attributes of the most reviewed properties, and you can see the rest of them within the Jupyter notebook in my linked GitHub repository. Now, it’s sort of hard to tell from the screenshot, but those first two entries are actually rooms within the same property, owned by some seemingly lovely folks named Dirk and Jaq. (More on this in a bit.) Looking at these top entries, most of them have about the same sorts of features. Things like…

  • Similar numbers of bedrooms and bathrooms
  • Relatively similar prices
  • Very similar scores

So… that wasn’t too informative, so let’s instead take a look at some of the reviews of the top 2 properties. Remember — those first two entries seem to fall in the same property owned by Dirk and Jaq, so we’ll look at the reviews for that property and a wholly different property.

Reviews for the top most reviewed property

Looking at a handful of the reviews from Jaq and Dirk’s property, I see lots of keywords related to location. Reviewer Rachel succinctly puts this with her quick review, “Good location.” Cleanliness and privacy also seem to be an important factors across this sample of reviews.

Now, let’s take a look at the next property.

Reviews for second most reviewed property

So this property is interesting to compare to the first one since this property is a house and not a loft. The most common compliment I see in these reviews has to do with the hosts, Amanda and Luisa. Of course, location is also an important factor, but what interests me most is the “family friendly” notes I see in here. I find that particularly interesting because the datasets we worked with did not indicate family-friendliness at all, and as a dad to two little girls, I know that would definitely be an important factor when looking to stay at a property. I’m wondering if AirBnB does capture this information elsewhere, and if not, it may be in their best interest to look at adding this as an established field for hosts to indicate.

Question 2: What is the average rental price each month?

Many people like to travel in off-peak seasons when prices are lower, and before I even started to look at the data, I made the immediate assumption that my findings would show that summer and holiday months would peak above others. (Because — of course — these are the times of years when schools generally break and/or the weather is most permitting.

Let’s see what the data has to share with us.

Average Rental Price Each Month in 2016

Surprise, surprise… my assumption was pretty much on point! Granted, there isn’t a huge swing between peak months and off-peak months, but there is still a noticeable increase in that summer timeframe (June-August) and a slight bump again in December.

Question 3: Can we predict overall rating based on other factors about the property?

This one is going to get a little bit dicey, but we’re going to try it out anyway. One of the attributes found in the AirBnB datasets is an overall rating score from 1 to 10 on a given rental property. I’m curious if we can use a large chunk of our dataset to correctly guess the score of the other part of the dataset using some machine learning magic. Specifically, we’ll be using two different supervised learning methods and comparing the results of each to see how successfully they did.

Given that it is a continuous predictor, we’ll run our data through three different types of regression models. We’ll be using three different models and calculating the R-Squared value of each. The three models we’ll be using include the following: Linear Regression, Support Vector Regression (SVR), and AdaBoost Regressor. The choice for these three comes in the simple fact that they progressively get more complex, so I’m interested to see how the r-squared value fares as each model gets more complex.

And after running our dataset through these models, here’s what we came up with:

R-Squared Scores

Yeah… pretty bad.

It’s interesting to note how close the basic Linear Regression model performed compared to the SVR model, but as made obvious in the chart, we saw a steep drop off with the AdaBoost Regressor model.

I’m personally not all that surprised. Much of the data in that particular set was very similar, and given that we stripped out a lot of information about neighborhood information (due to compute power issues), we basically asked the models to dance with very few instructions.

Conclusion

Phew! I know this post wasn’t very long, but if you go and check out the associated work to get to this point, you’ll see I spent a LOT of time making these simple analyses happen. Data manipulation and exploration can be fun, but it’s also a lot of work. Still, we got to learn some fun things about the AirBnB scene in Seattle, so it’s not particularly like this project was a bog.

Anyway, please be sure to check out my other posts! I use this Medium channel to write about philosophical matters as well as artificial intelligence in general. (I’ve tried making it easier for folks by making the title cards lighter for more business / AI oriented content and darker for the philosophical content.) Thanks for checking out this post, and happy travels!

--

--

David Hundley
David Hundley

Written by David Hundley

Principal machine learning engineer at a Fortune 50 company, 5x AWS certified, 2x HashiCorp certified, 1x GCP certified, M.A. in Org Leadership, PMP, ChFC, CSM

No responses yet