EDA from Scratch: Hands on Guide

#eda #matplotlib

Sanket Kangle Sept 18 2020 · 6 min read
Share this

I firmly believe in learning by doing. In this article we will solve Bike Sharing Demand problem from scratch with code explanation, So hang tight for the ride.

  Photo by asoggetti on Unsplash


Problem Statement

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. Our aim here is to combine historical usage patterns with weather data and do exploratory data analysis of the Capital Bikeshare program in Washington, D.C.

Data Loading and Understanding

You can use jupyter notebook/google colab or any other interactive .ipynb supported program to follow this exercise.

First things first, let’s import the necessary libraries

import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
# to show the figures in the jupyter notebook itself
%matplotlib inline

If you get error saying No module named ‘XYZ’. install that package by following command

pip install XYZ

You can download dataset form here or simply run following command

df = pd.read_csv(
'https://raw.githubusercontent.com/sanketkangle/Exploratory-Data-Analysis/master/Bike-Sharing-problem/bikes_sharing_problem_dataset.csv',
parse_dates=['datetime'])

pd.read_csv function reads a CSV(comma separated values) file and converts it into a dataframe and stores in variable called df.
parse_dates converts datetime column from object dtype to datetime64 dtype. It will help us to manipulate and analyse datetime column in future.

To see the head of dataframe, use following command

df.head()

You will get result like this

Image from author

To look at the tail of dataframe, use following command

df.tail()

To know number of records and columns, you can use following command

df.shape

Image from author

Dataframe has 10886 records and 12 columns

Now, let us understand the dataset: The data (hourly rental data spanning two years)is comprised of the first 19 days of each month. the coulmns/features are as below

  • datetime — hourly date + timestamp
  • season — 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • holiday — whether the day is considered a holiday
  • workingday — whether the day is neither a weekend nor holiday
  • weather — 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp — temperature in Celsius
  • atemp — “feels like” temperature in Celsius
  • humidity — relative humidity
  • windspeed — wind speed
  • casual — number of non-registered user rentals initiated
  • registered — number of registered user rentals initiated
  • count — number of total rentals
  • df.info()
    Image from author

    Dataset consists of total 10,886 records and Dtype for each column is as above. Columns season, holiday, workingday and weather are originally categorical variables, but they are already encoded in numerical values.

    Let us check for null values in our dataset

    msno.matrix(df)
    Image from author

    As seen from above graph, we do not have any missing values. alternate way to check missing values is as follows

    df.isna().value_counts()
    Image from author

    False under each column name signifies there is no missing value in that column.

    EDA

    The feature count gives total number of rental at particular time. Let’s see how count is varied w.r.t. other few features.

    Following command plots a scatter plot with season on x-axis and count on y-axis

    df.plot.scatter(x='season', y='count')
    Image from author

    From this plot we understand that for Fall season, the people used rental bike the most and in spring the least.

    Following command plots a scatter plot with holiday on x-axis and count on y-axis

    df.plot.scatter(x="holiday", y="count")

    Image from author

    It is clear from this plot that on holiday, people use less renal bikes. The cause for this might be that a lot of them using bikes to go to office/work and on holiday they are not using it.

    Following command plots a scatter plot with workingday on x-axis and count on y-axis

    df.plot.scatter(x="workingday", y="count")
    Image from author

    What we suspected in last plot is confirmed in this one. on workingday people used more rental bikes.

    Following command plots a scatter plot with weather on x-axis and count on y-axis

    df.plot.scatter(x="weather", y='count')
    Image from author

    Recall weather column data fields:

      weather 1: Clear, Few clouds, Partly cloudy, Partly cloudy2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog  

    As weather is getting clear to stormy, number of bike users decreased. Also the frequency for ‘4’ is low(only one data point) as such weather is also rare.

    Following command plots a scatter plot with temperature on x-axis and count on y-axis

    df.plot.scatter(x="temp", y="count")
    Image from author

    For extreme low temperature there are very few who used bikes and as temperature increases, the count also increases.

    Following command plots a scatter plot with humidity on x-axis and count on y-axis

    df.plot.scatter(x="humidity", y="count")
    Image from author

    for extremely low and high humidity less bikes are in use which makes sense as they are corresponding to adverse weather conditions.

    Following command plots a scatter plot with windspeed on x-axis and count on y-axis

    df.plot.scatter(x="windspeed", y="count")

    As wind speed increased, the count decreases gradually which is anticipated as again with high wind speed associated with bad weather.

    In conclusion till now, we understood two important points:
    1. Many people rent bike for traveling to office and home
    2. Bad weather conditions have adverse effect on bike rentals

    Now, let’s focus on how the count varies during the day and how other features affect it. We have to first get hour of the day from date time column, and then work with it.

    Below is a function that will return a bar plot of count based on hour for a given year

    def plot_by_hour(data, year=None, agg='sum'):
    dd = data #flag 1
    if year : #flag 2
    dd = dd[dd.datetime.dt.year == year]
    dd.loc[:, ('hour')] = dd.datetime.dt.hour #flag 3
    by_hour = dd.groupby(['hour', 'workingday'])['count'].agg(agg).unstack() #flag 4
    return by_hour.plot(kind='bar', title="year = {0}".format(year),figsize=(15,5), width=0.9) #flag 5

    We created a function called plot_by_hour which takes input one dataframe(data) and year.

  • #flag 1: we assign complete dataframe(both years 2011 & 2012) to dd
  • #flag 2: if year is provided as an argument to the function, then we slice the dataframe dd only for that year
  • #flag 3:creating a column called hour in dataframe dd and populating it with hour from datetime column
  • #flag 4: From dataframe dd, we are grouping the data by hour and workingday columns and then getting its count column and assigning it to a variable called by_hour
  • #flag 5: Returning the bar plot
  • Now, let’s plot the graph for individual years and combined

    plot_by_hour(df, year=2011)
    plot_by_hour(df, year=2012)
    plot_by_hour(df)
    Image from author
    Image from author
    Image from author

    From above plots, it is clear that working professionals do use rental bikes for commuting to office and home, as there are consistence spikes at rush hour(office starting and ending hours) on working day which are not there on holidays. Also, in night time, the use of rental bike is minimal

    The use of rental bikes had increases in 2012 than in 2011, let’s focus month-wise and hour-wise increase of rental bikes in 2012 than in 2011. Following is function to do the same

    def plot_by_year(data,col, title):
    dd = data.copy() #flag 1
    dd['year'] = data.datetime.dt.year #flag 2
    dd['month'] = data.datetime.dt.month #flag 3
    dd['hour'] = data.datetime.dt.hour #flag 4
    by_year = dd.groupby([col, 'year'])["count"].agg('sum').unstack() #flag 5
    return by_year.plot(kind='bar', figsize=(15,5), width=0.9, title=title) #flag 6

    We created a function plot_by_year which takes input dataframe(data), name of column(col) and title for plot(title)

  • #flag 1 : creating copy of dataframe so as to not disturb original dataset
  • #flat 2 : creating column named year in dataframe dd and populating with year extracted from datetime column of data
  • #flag 3 :creating column named month in dataframe dd and populating with month extracted from datetime column of data
  • #flag 4 : creating column named hour in dataframe dd and populating with hour extracted from datetime column of data
  • #flag 5 : grouping dataframe dd on columns col(given as argument to function), year and aggregating their sum and storing it in variable by_year
  • #flag 6 : returning the plot
  • plot_by_year(df,'month', "Rent bikes per month in year 2011 and 2012")
    Image from author

    From above graph, it is clear that, in every month, 2012 has more bikes rented than 2011. Let’s see, does by hour also 2012 is leading or not

    plot_by_year(df, 'hour', 'Rent bikes per hour in year 2011 and 2012')
    Image from author

    Yes! hourly also 2012 leads. This means in 2012, the rental got more popular i.e. marketing strategy in 2011 was good that is why count increased in 2012.

    Let’s count the percentage growth

    count_2011 = df[df.year==2011]['count'].agg('sum')
    count_2012 = df[df.year==2012]['count'].agg('sum')
    percentage_growth = (count_2012-count_2011)*100/count_2011

    Hence the percentage growth come around 66.69%

    Conclusion: (in addition to last conclusions)
    3. In office starting and ending hours, the demand for rental bikes is more as many working professional use it as a way of commute
    4. In night time, the demand is very low
    5. Marketing strategy used in 2011 must be efficient, as count has increased in 2012 by margin of 66.69%

    You can get the complete code for this exercise here.

    * * *

    Thanks for reading the article! Wanna connect with me?
    Here is link to my Linkedin Profile

    Comments
    Read next