In this blog we are going to know about DASK.
DASK is a flexible library for parallel computing in Python.
For slow tasks operating on large amounts of data, you should definitely try DASK out. As it only require very minimal changes to your existing Pandas code to get faster code with lower memory use.
1.Installation Of DASK using PIP:
pip install "dask[complete]"
Here we can see that all the code is same except the import statement.
To read the csv data using DASK.
We can observe that we cant see the data in the output if we call the 'df'. So in order to read the complete data we should use .compute()
But generally we dont require to display the whole data so we will be working with the specific conditions.
Now let us see some basic operations:
To select the column data i.e., element wise operations.
To select the rows based on the given conditions.
We cant use inplace=True in DASK
To set the specific column as index we need to use set_index()
7.LOC & ILOC:
loc is used to select the given range of elements on row wise operations based on indexes.
We can use Floating Point numbers in the range of indexes.
iloc is used to select the data based on indexes of Columns.
We can preform some inbuilt methods for column data.
Is in is used to select the row data given in given list or range.
To grouby by specific columns in the given data we will use groupby().
Now we will see how to read EXCEL data by using DASK, as we don't have an inbuilt read function in DASK to read EXCEL files.
To find the Country which has largest Quantity.
We use .nlargest() to find the largest item and similarly we will be using .nsmallest() to display the smallest item.
Applying a specific function to groupby()
Counts the number of times the value is repeated.
To drop the repeated values.
Converting date and time Object to datetime data type.
These are some of the basic operations that can be performed by using DASK.