Part I - Ford GoBike System Slideshow San Fransico Bay Area (February 2019)

by Oke Oladunsi

Table of Contents

Introduction

Bay Wheels (previously known as Ford GoBike) is a regional public bike sharing system in the San Francisco Bay Area, California. Bay Wheels is the first regional and large-scale bicycle sharing system deployed in California and on the West Coast of the United States with nearly 500,000 rides since the launch in 2017 and had about 10,000 annual subscribers as of January 2018.

The dataset used for this exploratory analysis consists of This dataset includes information about individual rides made in a bike-sharing system covering the Francisco Bay area of travels for the month February year 2019 for February 2019 alone in CSV format covering the greater San Francisco Bay area, also available here data for other cities.

Preliminary Wrangling

This data needs a little wrangling. so I would be changing some variables/features datatypes into appropriate ones using astype methed

Gathering

Assessing the data

Cleaning

Checking if there are duplicated rows

Checking the amount of rows with empty values

Feature Engineering

creating new columns that would be needed for visualizations

Code

More features are created

Code

NB: The essence of the feature engineering done above to enable be provide insights into how to be effectively maximise both human and non-human resources to increase profit and at the same time provide premium bike sharing service

Test

Code

Code

Test

What is the structure of your dataset?

This dataset initially consists of a total of 183412 rows and 16 columns where six of th 16 columns are having null values you can check the cell above for the amount of null entries for each feature. Then after I had wrangled the data it has reduced to 174952 rows and the columns increased to 21 columns. This dataset is for rides in the month of february 2019

What is/are the main feature(s) of interest in your dataset?

I'm most interested in exploring the bike trips' duration and rental events occurrance patterns, along with how these relate to the riders' characteristics, i.e. their user type, gender, age, etc, to get a sense of how and what people are using the bike sharing service for. Sample questions to answer: When are most trips taken in terms of time of day, day of the week, or month of the year? How long does the average trip take? Does the above depend on if a user is a subscriber or customer?

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

member_birth_year,start_station_latitude,start_station_longitude,end_station_latitude,start_station_latitude are features i think i will be needing for this visualizations though I might probably create new ones to support these one I currently have.

Univariate Exploration

Checking how correlated each numeric columns are to each other

What is the distribution of travel time and ages of riders

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The duration_sec is not normally distributed but skewed. Then I create new feature travel_minutes to enable me observe it more closely which shows that the majority of the members did not use bike share for all of their trips, and most were around 25 to 40 years old. Most rides were quick and short, lasted between 5 to 10 minutes, that some riders travel time is close to 23hrs which is not normal.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, travelTime_hour shows that 99% of rides are within 1hr and about 91% of these rides are within 20mins. there are some features that have unusual distributions like the ages of rides that holds a values of 141 years that seems odd.

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

NB This chart above are for rides within One hour (1h)

#

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There are a lot more subscriber usage than customers. The riding habit/pattern varies a lot between subscribers and customers.
Subscribers use the bike sharing system for work commnute thus most trips were on work days (Mon-Fri) and especially during rush hours
(when going to work in the morning (8am) and getting off work in the evening (5pm)).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is significant difference in the avagrage time travel by customer(13mins) and subscribers(8mins). where is seems subscribers tend to know where they going hereby reducing travel time.

People travel time from monday through thursday are within 10 mins but on friday and saturday the avg time goes above 10 mins which occassionaly goes above 15 min but not more than 17mins. Yet during the weekends we have some rides that goes beyond 17mins int 20mins

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

We have been looking at the entire dataset since, But What about checking out how Younger riders riding patterns differs from older riders

NB: The chart of youth riding pattern and that of the general looks the same.
NB: hypothesis ; Older riders should experience a reverse of the trend we have been noticing
looking at their riding pattern using gender
The chart above will prompt more insight to how these use the service during the weekend by looking at the hourly and period features for riders >60yrs

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

user within the ages of 18-40 spend about 10mins on average for both customers and subscribers in 24hrs

  • on the weekly chart we can see that the youth trend is similar to that of the whole data where both subcribers and customers spend more travel time riding during the weekends
  • throughout the month of february young subcribers spend about 10mins on riding to their destinations but customers are travel time is not constant.

Were there any interesting or surprising interactions between features?

People from 60 years above we can see that there seem to a normal distributed fig from 6am to 11pm based on gender feature for all genders but even though the user type looks normally distributed yet there are time customers don't use the service at all like 6am-7am and from 8pm till 4am

  • Usage of the service for older people peaked at 1pm for subcribers and 10am for customers.
  • But using gender the peak period is 2pm in the afternoon for all gender.

Conclusions

With the above insight and plots, We can see huge variations of ride durations for the customers but this is not the case for the subscribers.
My interpretation is that customers tend to be tourists or visitors that show up during the day to commute around town so the rides tend to be longer and mostly happen around noon. For the subscribers, it's more likely for them to live in the local area and commute daily depending on their work schedules. That's why there's no massive peak of trip duration time and the durations remain consistent (8am and 17pm) in the both young riders and older riders.

We can also deduce that most riders don't like to share their rides probably due to the fact that riders need to get to where need to get to in time.

The weekend rides that tends to increase

Limitations

There the dataset is enough to enable us build a machine learning solution that can enable us predict users travel_time from station to station. Though there are more but I will limit it to the one stated.

References