When you begin your data science journey with python, you will quickly understand that Pandas and Numpy are the core libraries utilized for all sorts of data analysis and data manipulation. Any data analysis or data manipulation operation that you can think of is already implementable on pandas and numpy. The main reason that these libraries have been so popular is the fact that they combine high level usable and flexible API's with high-performance implementations and easy to use structures. Pandas and Numpy are incredible libraries, but they are not always computationally productive. Even these libraries have certain constraints in terms of scalability and speed when working on large datasets. As a solution to these issues, Dask comes in.
What is Dask?
Dask is a flexible library for parallel computing in python. It is a python library that can deal with big datasets on a solitary CPU by utilizing multiple cores or on distributed machines. It works in good synergy with other python packages while giving them the versatility of parallel computing and scalability. The data structures available in dask includes:-
In this article, we'll have a close look at the working of dask array along with the numpy array for numeric processing.
Before exploring the dask array let's first look at how to set-up dask on your system. You can install dask with conda or with pip.
Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms. In simple words, its distributed numpy array.
Converting a Numpy array into Dask array
Importing Numpy and Dask
import numpy as npimport dask.array as da
We create a numpy array containing 10000 elements produced randomly.
We can use the
from_array() function of dask to convert a numpy array into a dask array. Additionally we have to define an argument
chunk this is the number of elements in each piece.
Using Methods/Attributes with Numpy vs Dask
#a=(10000,) size randomly assigned arrayprint('mean calculation done over numpy array:',a.mean())
a.sum() computes the sum immediately.
a_dask.mean() builds an expression of the computation don't do it yet
print('mean calculation done over dask array:',a_dask.mean().compute())
Dask array objects are lazily evaluated. Operations like .mean() build up a graph of blocked tasks to execute. We ask for the final result with a call to .compute(). This triggers the actual computation.
Common Dask Array Methods/Attributes similar to Numpy
Working with chunks/parts in Numpy vs Dask
n_parts=4chunk_size=int(len(a)/n_parts)total=0for i in range(n_parts):starting_index=int(i*chunk_size)print('index of part '+str(i+1)+' start from '+str(starting_index))a_parts=a[starting_index:starting_index+chunk_size]total += a_parts.sum()print(total)
We start by defining the number of parts to break down our array than we iterate over a for loop calculating the indices and slicing the array, adding them into a list, and summing up the elements partwise. Here each loop iteration is independent and can be executed in parallel.
We can see here we don't need to compute index and slice chunks explicitly. The dask array method does that for us. Calling just the
sum() function builds an expression of the computation and the actual computation will be delayed until
compute() is called.
Dask array translates your array operations into a graph of inter-related tasks with data dependencies between them. Dask then executes this graph in parallel with multiple threads.
It is easy to get started with Dask arrays, but using them well does require some experience. Going forward there is a lot more to explore in terms of dask array. Remember it all depends on practice.