Data manipulation with Dask dataframe

#datascience #machinelearning #dataanalysis

AMAN JAISWAL Sept 23 2020 · 4 min read
Share this
DASK

Dask as Introduction

Dask is open source library for parallel computing written in Python. Originally developed by Matthew Rocklin, Dask is a community project maintained and sponsored by developers and organization. Check for more info.

Why Dask?

  • The ability to work in parallel with  NumPy array and Pandas DataFrame objects
  • integration with other projects.
  • Distributed computing
  • Faster operation because of its low overhead and minimum serialisation
  • Runs resiliently on clusters with thousands of cores
  • A big file which does not fit in memory can be handled by Dask.
  • Overview

    In this Blog we will learn to manipulate US Accident(A Countrywide Traffic Accident Dataset 2016-20) with Dask dataFrame. This is a huge dataset with above 3.5 million of rows and 49 columns. The size of dataset is 1.24 gigabytes. 

    Dask dataframe

    A Dask dataFrame is a large parallel dataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask dataFrame operation triggers many operations on the constituent Pandas DataFrames. For more info check this.

    import dask.dataframe as dd
    import dask.array as da
    import pandas as pd
    import numpy as np

    dask_df = dd.read_csv('/kaggle/input/us-accidents/US_Accidents_June20.csv')

    As said dask dataframe contain many smaller pandas dataframe lets verify.

    dask_df.map_partitions(type).compute()

    Checking length of each partition

    dask_df.map_partitions(len).compute()

    There are total 21 partitions by default of different length.

    Accessing one of the partition

    dask_df.partitions[1].compute()

    Let's explore the dataset with dask dataframe

    Getting top 5 rows of the dataframe

    dask_df.head()

    Getting last 5 rows

    dask_df.tail()

    Getting five number summary with mean and count.

    dask_df.describe().compute()

    Setting  ID column as a index

    dask_df = dask_df.set_index('ID')
    dask_df

    Dask does the lazy computation. Here we get the skelton of the dask dataframe in which index is id.

    Appending new column

    Severity shows the impact on traffic duration.So lets find in which case  traffic duration is too long . We append a new column named 'long_delay' which gives True if traffic duration has longest duration. In Severity longest traffic duration is denoted by number 4.

    dask_df['long_delay'] = dask_df['Severity']==4

    dask_df.head()                  # Checking for new column

    At the right of the table, long_delay column is appended.

    Changing datatypes of columns

    dask_df['long_delay'].dtype # output: bool
    dask_df['long_delay'] = dask_df['long_delay'].astype('int')
    dask_df['long_delay'].dtype # output: int64

    Data type of long_delay column of dask dataframe changes from bool to int64.

    Getting unique element

    dask_df['Severity'].unique().compute()

    '''
    Output:
    0 3
    1 2
    2 4
    3 1
    '''

    Above output shows that there are four unique element in the Severity. These are 3,2,4 and 1.

    Count of each unique element

    dask_df['Severity'].value_counts().compute()

    '''
    Output:
    2 2373210
    3 998913
    4 112320
    1 29174
    '''

    Accessing the data of data frame

          1.  Access particular row

    dask_df.loc['A-3'].compute()

            2.  Access particular column

    dask_df.loc[:,'City'].compute()

           3.  Access particular row and column

    dask_df.loc['A-100','State'].compute()

    '''
    Output:
    ID
    A-100   OH
    NameState, dtypeobject
    '''

           4. Accessing a range of rows 

    dask_df.loc['A-5':'A-9']

    Conditional search

    Searching for Latitiude is equal to 39.865147

    dask_df[dask_df['Start_Lat']== 39.865147].compute()

    Multiple Condiational Search

    Searching for Longitude and Latitude 

    dask_df[da.logical_and(dask_df['Start_Lng']==-86.779770,dask_df['Start_Lat']== 36.194839)].compute()

    Getting the Number of Null values of each columns

    dask_df.isna().sum(axis=0).compute()

    Filling Null values

    Calculating Number of null values in Wind Speed columns which is equal to 454609.

    dask_df['Wind_Speed(mph)'].isnull().sum().compute()  # output 454609

    Replacing null values as 10 mph.

    dask_df['Wind_Speed(mph)'] = dask_df['Wind_Speed(mph)'].fillna(10)

    Checking the count of null values in wind speed column whether the null values are filled.

    dask_df['Wind_Speed(mph)'].isnull().sum().compute()  # output 0

    Drop rows with Null value

    Lets Drop the rows whose Zipcode value is null.

    dask_df = dask_df.dropna(subset=['Zipcode'])

    Dropping columns

    Droping long_delay column

    print(any(dask_df.columns=='long_delay'))       # output True
    dask_df = dask_df.drop('long_delay',axis=1) # Droping long_delay column
    print(any(dask_df.columns=='long_delay')) # output False

    Groupby method

    Let We want to know the average temperature of each state. Here is groupby method is used. Grouping the dataframe by state.

    byState = dask_df.groupby('State')

    Getting the average temperature of each state.

    byState['Temperature(F)'].mean().compute()

    These are some useful methods of dask dataframe which is commonly used while dealing with the dataframe. Hoping it will help you while learning DataScience. Thank you!

    A verification kernel for this blog Check here.

    Comments
    Read next