This is the first post out of a series of 3 posts that have the intention of making you comfortable with Dask and even teaching you some of its more advanced functions. In this post we will see some theoretical insights into why Dask is so useful and also cover its most basic use cases. In the following posts we will dig deeper into more complex endeavours.
Lets get to it!
What is Dask?
Dask is a flexible library for parallel computing in Python.
Libraries such as Pandas provide you with data-frames where you can store data. Operations such as adding, deleting, appending and loading of data can be performed on the data-frames.
On the other hand, Dask also provides you with data-frames but works on the concept of parallel computation. Let’s assume your system to have 4 processors (Or cores). What a regular data-frame such as the ones in Pandas or Numpy does is makes use of only one core for processing the data.
This is good enough as long as the data set is small especially when it is to be loaded. But if the data sets are enormous as in Giga-Bytes, one core alone may not be sufficient to handle the entire data set and as a result slows down the process. What Dask framework does is divides the process equally into the 4 cores OR processors and hence makes the process comparatively very fast.
The Data: Let's Start with The famous Titanic
As the goal of this post is to explore the different capabilities of Dask, we will not be using a complex data set, but rather a very simple one, that does however contain different kinds of features and some complications which we will have to solve through the the appropriate use of Dask.we will use diffrent datasets for next post
How To Install Dask?
You can install dask with
, or by installing from source.
!pip install "dask[complete]"
How to use Dask Dataframes
Dask Dataframes have the same API as Pandas Dataframes, except aggregations and applys are evaluated lazily, and need to be computed through calling the compute method. In order to generate a Dask Dataframe you can simply call the read_csv method just as you would in Pandas or, given a Pandas Dataframe df.
import dask.dataframe as dddf = dd.read_csv('titanic_test.csv')
df.head() -This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
info() -This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
df.tail()- This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.
df.describe()- It is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values.
df.columns -To know all columns present in dataset
df.index.compute()- To know number of indexes present in dataset
Selection with [ ], .loc and .iloc
To select a single column of data, simply put the name of the column in-between the brackets. Let’s select the PassengerId column. we can check type, which is in series.
Selecting multiple columns with just the indexing operator
It’s possible to select multiple columns with just the indexing operator by passing it a list of column names. we can check type, which is in DataFrame.
Getting started with .loc
.loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns. Most importantly, it only selects data by the LABEL of the rows and columns.
Select a single row as a DataFrame with .loc
.loc indexer will return a single row as a Series when given a single row label. Let's select the row for 1.
Select multiple rows as a DataFrame with .loc
Selecting rows and columns simultaneously with .loc
Unlike just the indexing operator, it is possible to select rows and columns simultaneously with
.loc. You do it by separating your row and column selections by a comma.
Getting started with .iloc
Dask DataFrame does not track the length of partitions, making positional indexing with
.iloc inefficient for selecting rows.
DataFrame.iloc() only supports indexers where the row indexer is
: is a shorthand for.)
This is only part 1 of the series, Some of the explanations in this part will be expanded to include other possibilities.