Oh, the Places You’ll Go! Analyzing Chicago Divvy Bike Share Data
A modern city lives, breathes and eats data. Big data to be exact! Pick any urban center. You can’t walk one block without feeling the presence of major tech companies. Even the places where you least expect it are passively collecting massive amounts of data: your local coffee shops, street lights and even post offices. In fact, a great amount of this data is publicly available for you to donload and analyze.
Another quality that most major cities have in common is traffic—rush hours are filled with walkers, bikers, taxis and buses. To ease congestion, the city of Chicago’s Department of Transportation created a bike-sharing service called Divvy. Key statistics are provided about each user’s ride, including trip duration, geolocation, start and stop times and much more. Let’s take a deeper look to see what we can find out using OmniSci!
Data Cleaning and PyMapD
A quick glance at the data reveals a great deal of useful information like User Type, Starting and Stopping Station Locations and BikeID. However, data pre-processing would unlock the full value of the data and correct some errors that are present in the raw download. For example, I noticed that the zip codes were incorrect in the Dept of Transportation’s data. To fix this, I found a Python library that accessed zip codes using longitude and latitude (which were given correctly). I also grouped the zip codes into regions within the ChicagoLand Area (Downtown, Southside, Northside, Near Northside, Far Northside, and Western Suburbs).
Further data pre-processing revealed even more useful metrics. For example, Birth Year is one of the variables provided for a user. In my opinion, a more useful metric is the user’s age. In Python, I calculated the user’s age when they used bikes and also grouped the age into Generations (GenZ, Millennials, GenX, Baby Boomers, and Traditional). During this process, I noticed that some rows had missing data from a few of the variables, so I removed those. This left me with 8,646,054 rows and 31 columns of data to dig into.
While at first I imported data directly into Immerse by compressing the files, I found that using the pymapd connector allowed me to insert data much faster. This made it significantly easier for me to create a new feature, instantly add it as a table and make new graphs. To see the specific details of the data , the full code is available at: https://github.com/sagoyal2/DivvyBikeShareData
Deep Dive into Divvy Data
To understand the Divvy Dataset, I needed to figure out the same broad questions that define transportation data at large:
- Where are people going?
- Who is going?
- When are they going?
- Why are they going?
- How are they going?
The last answer is rather obvious for this case—the Divvy Bikes! The second to last answer is also somewhat obvious—people travel for either business (to go to and from work) or pleasure (to enjoy the cityscape).
No thanks, I think I’ll bike…
To start the analysis and figure out where people are going, I created a simple Geo Heatmap that measured the average trip duration at each bike station.
I expected to see longer trips when people used the bikes in the city as opposed to those on the edge of the city, because of workers arriving by train. The supports this: the average trip duration was greater as users got closer to the heart of the city, and decreased around the periphery. Notice in particular that stations along the coast of Lake Michigan have lower average rides than those in the city. A plausible explanation for this is that the majority of the bike stations that have higher average trip durations are used by commuters who are going to work in the morning and arrive via the train.
To investigate this further I clustered each station by its location in the ChicagoLand Area and classified if the station was in the immediate vicinity of Union Station or Ogilvie Transportation Center (the two major transit stations that bring people into the city).
The Downtown bike stations have a greater proportion of riders who use bike stations closer to major transit stops throughout the year.
The power of OmniSci's crossfiltering capabilities helped me query and instantly render graphics that compared ChicagoLand regions and transit stops. The histogram on the left shows that when we just consider the Downtown area bike stations there is a greater proportion of the people who use bike stations close to major transit stops than when compared to the entire ChicagoLand area. Also, the donut chart on the right indicates that the Downtown area has on average the longest trip duration (6.54 minutes) which matches the distribution I found on the heat map.
The donut chart also reveals that people in the Western Suburbs are on average the oldest users of Divvy bikes—indicated by the red color—with an average age of 39 years old. What else can we find out about a user’s age?
OMG, that’s so 2018...
People everywhere are finding ways to stay active and it seems like every week there is a new trend that catches on: hot yoga, climbing and biking just to name a few. This is especially common among younger generations. So why not start using Divvy to kill two birds with one stone? By biking to work, users can get their exercise in and finish their daily commute. Let’s break this down by generation.
Generations: Gen Z : >1996 , Millennials: 1977-1995, Gen X: 1965 – 1976, Baby Boomers: 1945-1964, Traditional: <1946
The bubble chart on the left captures the sense of adventure that people of varying generations have. The bubble color shows the number of unique stops at various stations for each generation and the size of the bubble represents the number of records. As one might expect, the Traditional generation (those born between 1925-1945) are on average the least experimental when it comes to using Divvy bikes. They have both the lowest number of unique starting and stopping stations as well as a shorter average trip duration. Gen Z Divvy users, on the other hand, are willing to experiment and try many different starting and stopping locations, but tend to only take brief trips. There might be some truth in the idea that our attention spans are getting shorter!
Interestingly, Millennials and Gen X’ers are willing to use the Divvy bikes during peak commuting hours to travel both to and from work (Gen X’ers more than Millennials). Gen Z’ers mostly use the bikes when commuting to work in the morning, but not on the way back. Baby Boomers and Traditionals are not so keen on using them to commute to work at all.
Bike in Bulk?
Like many products, it saves money to buy in bulk, so why not do the same for bike passes?
The Downtown area has more Gen Z Customers (daily users), than the ChicagoLand Area at large.
Divvy calls its users with monthly passes Subscribers, and users with daily passes Customers. The figure on the left shows that for every generation there are more Subscribers than Customers. However, in the Downtown area there is a greater proportion of daily users among Gen Z. The data seems to show that younger people are more spontaneous when it comes to biking.
Park the Bike: Experience with OmniSci Immerse
Overall, getting started with the OmniSci Immerse platform was an engaging experience. This project in particular provided a real data analysis workflow, as it accounted for data cleaning and feature manipulation with the raw dataset.
Some key tips I’ve learned while using OmniSci:
- Bubble Chart requires a summary statistic for the axes, so since the Trip Duration was given in seconds we can convert that to minutes and by a summary statisticavg(TRIPDURATION/60)
- If you expect to do data cleaning, consider using PyMapD as it will save a lot of time
- You can run SQL queries directly against your table by using SQL Editor before creating your own custom functions on the graph
Try it for yourself today and see what can be done with OmniSci!