Get Faster Pandas With Modin

Sai Kumar Sept 25 2020 · 3 min read
Share this

Pandas is a library that needs no introduction in the field of data science. It provides high-performance, easy-to-use data structures, and data analysis tools. However, when working with excessively large amounts of data, Pandas on a single core becomes insufficient and people have to resort to different distributed systems to increase their performance. The tradeoff for improved performance, however, comes with a steep learning curve. Essentially users probably just want Pandas to run faster and aren’t looking to optimize their workflows for their particular hardware setup. This means people want to use the same Pandas script for their 10KB dataset as their 10TB dataset. Modin offers to provide a solution by optimizing pandas so that Data Scientists spend their time extracting value from their data than on tools that extract data.

For a more detailed description, you can have a look at a python notebook file mentioned below

Link: https://github.com/saikumar110/Get-Faster-Pandas-With-Modin

Installation

There are a couple of ways to install Modin. Most users will want to install with pip, but some users may want to build from the master branch on the GitHub repo. The master branch has the most recent patches but may be less stable than a release installed from pip.

Installing with pip

Modin can be installed with pip. To install the most recent stable release run the following:

pip install -U modin # -U for upgrade in case you have an older version

If you don’t have Ray or Dask installed, you will need to install Modin with one of the targets:

pip install modin[ray] # Install Modin dependencies and Ray to run on Ray
pip install modin[dask] # Install Modin dependencies and Dask to run on Dask
pip install modin[all] # Install all of the above

If you would like to install a pre-release of Modin, run the following:

These pre-releases are uploaded for dependencies and users to test their existing code to ensure that it still works. If you find something wrong, please raise an issue or email the bug reporter: [email protected]

pip install --pre modin

Installing from the GitHub

pip install git+https://github.com/modin-project/modin

 Windows

For installation on Windows, it is recommended to use the Dask Engine. Ray does not support Windows, so it will not be possible to install modin[ray] or modin[all]. It is possible to use Windows Subsystem For Linux (WSL), but this is generally not recommended due to the limitations and poor performance of Ray on WSL, a roughly 2-3x cost.

To install with the Dask engine, run the following using pip

pip install modin[dask]

 Building Modin from Source

if you’re planning on contributing to Modin, you will need to ensure that you are building Modin from the local repository that you are working off of. Occasionally, there are issues in overlapping Modin installs from PyPI and from the source. To avoid these issues, we recommend uninstalling Modin before you install from the source

pip uninstall modin

Once cloned, cd into the modin directory and use pip to install:

cd modin
pip install -e

Using Modin

Modin is an early stage DataFrame library that wraps pandas and transparently distributes the data and computation, accelerating your panda's workflows with one line of code change. The user does not need to know how many cores their system has, nor do they need to specify how to distribute the data. In fact, users can continue using their previous panda's notebooks while experiencing a considerable speedup from Modin, even on a single machine. The only modification of the import statement is needed, as we demonstrate below. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas, since the API is identical to pandas

# import pandas as pd
import modin.pandas as pd

Comparisons

Modin manages the data partitioning and shuffling so that users can focus on extracting value from the data. The following code was run on a 2013 4-core iMac with 32GB 

pd.read_csv

read_csv is by far the most used pandas’ operation. Let’s do a quick comparison when we use read_csv in pandas vs modin

Pandas

%%timeimport pandas
pandas_csv_data = pandas.read_csv("../800MB.csv")-----------------------------------------------------------------CPU times: user 26.3 s, sys: 3.14 s, total: 29.4s
Wall time: 29.5 s

Modin

%%time
modin_csv_data = pd.read_csv("../800MB.csv")-----------------------------------------------------------------CPU times: user 76.7 ms, sys: 5.08 ms, total: 81.8 ms
Wall time: 7.6 s

  With Modin, read_csv performs up to 4x faster on a 4-core machine just by changing the import statement

Conclusions

Modin is still in its early stages and appears to be a very promising add on to the pandas. Modin handles all the partitioning and shuffling for the user so that we can essentially focus on our workflows. Modin’s basic goal is to enable the users to use the same tools on small data as well as big data without having to worry about changing the API to suit different data sizes.

Visit the Documentation for more information!

modin.pandas is currently under active development. Requests and contributions are welcome!

More information and Getting Involved

  • Documentation
  • Ask questions or participate in discussions on our Discourse
  • Join our mailing list [email protected]
  • Submit bug reports to our GitHub Issues Page
  • Contributions are welcome! Open a pull request
  • * * *
    Comments
    Read next