How to work with DASK?

#pandas #dask #python #datascience #ml

Hemanth Vangara Sept 21 2020 · 3 min read
Share this

Hi, 

In this blog we are going to know about DASK.

DASK is a flexible library for parallel computing in Python.

For slow tasks operating on large amounts of data, you should definitely try DASK out. As it only require very minimal changes to your existing Pandas code to get faster code with lower memory use.

1.Installation Of DASK using PIP:

                                                                   pip install "dask[complete]"

2.Importing DASK:

                                                                       import dask.dataframe

Importing PANDAS vs DASK

Here we can see that all the code is same except the import statement.

3.Reading Data:

To read the csv data using DASK.

Reading csv file

We can observe that we cant see the data in the output if we call the 'df'. So in order to read the complete data we should use .compute()

Displays the complete data

But generally we dont require to display the whole data so we will be working with the specific conditions. 

Now let us see some basic operations:

To display all the columns present in the data.
To see what kind of columns are present.

4.Element-wise operations:

To select the column data i.e., element wise operations.

5.Row-wise selections:

To select the rows based on the given conditions.

Displays the rows which satisfies the given conditions.

Note:

We cant use inplace=True in DASK

6.Set Index:

To set the specific column as index we need to use set_index()

sets index column as index.

7.LOC & ILOC:

loc is used to select the given range of elements on row wise operations based on indexes.

We can use Floating Point numbers in the range of indexes.

displays the row data  in the range of 5.0 and 10.5

iloc is used to select the data based on indexes of Columns.

Selects columns data in range of 0 and 4 in range of 2.

NOTE:

  • We can use floating point numbers in loc whereas we can use in case of iloc.
  • We should not pass the row indexes and column name in case of iloc
  • 8.Common aggregations: 

    We can preform some inbuilt methods for column data.

    Returns the maximum and sum of Prices in the given data

    9.Is in:

    Is in is used to select the row data given in given list or range.

    10.Groupby:

    To grouby by specific columns in the given data we will use groupby().

    Groupby company name and displays the total price respectively.

    Now we will see how to read EXCEL data by using DASK, as we don't have an inbuilt read function in DASK to read EXCEL files.

    Reads EXCEL data using PANDAS and DASK.
    To Group the data based on Country Column.

    To find the Country which has largest Quantity.

    We use .nlargest() to find the largest item and similarly we will be using .nsmallest() to display the smallest item.

    Country with Largest Quantity.
    Country with Smallest Quantity.
    To find the 5 largest.

    Applying a specific function to groupby()

    Increasing Price column data by 10

    11.value_counts:

    Counts the number of times the value is repeated.

    12.Drop duplicates:

    To drop the repeated values.

    drops repeated values.

    13.DateTime:

    Converting date and time  Object to datetime data type.

    Converts to datetime data type.

    These are some of the basic operations that can be performed by using DASK.

    Comments
    Read next