
Dask as Introduction
Dask is open source library for parallel computing written in Python. Originally developed by Matthew Rocklin, Dask is a community project maintained and sponsored by developers and organization. Check for more info.
Why Dask?
Overview
In this Blog we will learn to manipulate US Accident(A Countrywide Traffic Accident Dataset 2016-20) with Dask dataFrame. This is a huge dataset with above 3.5 million of rows and 49 columns. The size of dataset is 1.24 gigabytes.
Dask dataframe
A Dask dataFrame is a large parallel dataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask dataFrame operation triggers many operations on the constituent Pandas DataFrames. For more info check this.
import dask.dataframe as ddimport dask.array as da
import pandas as pdimport numpy as np
dask_df = dd.read_csv('/kaggle/input/us-accidents/US_Accidents_June20.csv')
As said dask dataframe contain many smaller pandas dataframe lets verify.
dask_df.map_partitions(type).compute()

Checking length of each partition
dask_df.map_partitions(len).compute()

There are total 21 partitions by default of different length.
Accessing one of the partition
dask_df.partitions[1].compute()

Let's explore the dataset with dask dataframe
Getting top 5 rows of the dataframe
dask_df.head()

Getting last 5 rows
dask_df.tail()

Getting five number summary with mean and count.
dask_df.describe().compute()

Setting ID column as a index
dask_df = dask_df.set_index('ID')dask_df

Dask does the lazy computation. Here we get the skelton of the dask dataframe in which index is id.
Appending new column
Severity shows the impact on traffic duration.So lets find in which case traffic duration is too long . We append a new column named 'long_delay' which gives True if traffic duration has longest duration. In Severity longest traffic duration is denoted by number 4.
dask_df['long_delay'] = dask_df['Severity']==4
dask_df.head() # Checking for new column

At the right of the table, long_delay column is appended.
Changing datatypes of columns
dask_df['long_delay'].dtype # output: booldask_df['long_delay'] = dask_df['long_delay'].astype('int')dask_df['long_delay'].dtype # output: int64
Data type of long_delay column of dask dataframe changes from bool to int64.
Getting unique element
dask_df['Severity'].unique().compute()
'''Output: 0 31 22 43 1'''
Above output shows that there are four unique element in the Severity. These are 3,2,4 and 1.
Count of each unique element
dask_df['Severity'].value_counts().compute()
'''Output: 2 23732103 9989134 1123201 29174'''
Accessing the data of data frame
1. Access particular row
dask_df.loc['A-3'].compute()

2. Access particular column
dask_df.loc[:,'City'].compute()

3. Access particular row and column
dask_df.loc['A-100','State'].compute()
'''Output:IDA-100 OHName: State, dtype: object'''
4. Accessing a range of rows
dask_df.loc['A-5':'A-9']

Conditional search
Searching for Latitiude is equal to 39.865147
dask_df[dask_df['Start_Lat']== 39.865147].compute()

Multiple Condiational Search
Searching for Longitude and Latitude
dask_df[da.logical_and(dask_df['Start_Lng']==-86.779770,dask_df['Start_Lat']== 36.194839)].compute()

Getting the Number of Null values of each columns
dask_df.isna().sum(axis=0).compute()

Filling Null values
Calculating Number of null values in Wind Speed columns which is equal to 454609.
dask_df['Wind_Speed(mph)'].isnull().sum().compute() # output 454609
Replacing null values as 10 mph.
dask_df['Wind_Speed(mph)'] = dask_df['Wind_Speed(mph)'].fillna(10)
Checking the count of null values in wind speed column whether the null values are filled.
dask_df['Wind_Speed(mph)'].isnull().sum().compute() # output 0
Drop rows with Null value
Lets Drop the rows whose Zipcode value is null.
dask_df = dask_df.dropna(subset=['Zipcode'])
Dropping columns
Droping long_delay column
print(any(dask_df.columns=='long_delay')) # output True
dask_df = dask_df.drop('long_delay',axis=1) # Droping long_delay column
print(any(dask_df.columns=='long_delay')) # output False
Groupby method
Let We want to know the average temperature of each state. Here is groupby method is used. Grouping the dataframe by state.
byState = dask_df.groupby('State')
Getting the average temperature of each state.
byState['Temperature(F)'].mean().compute()

These are some useful methods of dask dataframe which is commonly used while dealing with the dataframe. Hoping it will help you while learning DataScience. Thank you!
A verification kernel for this blog Check here.