Pandas is a library that needs no introduction in the field of data science. It provides high-performance, easy-to-use data structures, and data analysis tools. However, when working with excessively large amounts of data, Pandas on a single core becomes insufficient and people have to resort to different distributed systems to increase their performance. The tradeoff for improved performance, however, comes with a steep learning curve. Essentially users probably just want Pandas to run faster and aren’t looking to optimize their workflows for their particular hardware setup. This means people want to use the same Pandas script for their 10KB dataset as their 10TB dataset. Modin offers to provide a solution by optimizing pandas so that Data Scientists spend their time extracting value from their data than on tools that extract data.
For a more detailed description, you can have a look at a python notebook file mentioned below
There are a couple of ways to install Modin. Most users will want to install with pip, but some users may want to build from the master branch on the GitHub repo. The master branch has the most recent patches but may be less stable than a release installed from pip.
Installing with pip
Modin can be installed with pip. To install the most recent stable release run the following:
pip install -U modin # -U for upgrade in case you have an older version
If you don’t have Ray or Dask installed, you will need to install Modin with one of the targets:
pip install modin[ray] # Install Modin dependencies and Ray to run on Ray pip install modin[dask] # Install Modin dependencies and Dask to run on Dask pip install modin[all] # Install all of the above
If you would like to install a pre-release of Modin, run the following:
These pre-releases are uploaded for dependencies and users to test their existing code to ensure that it still works. If you find something wrong, please raise an issue or email the bug reporter: [email protected]
pip install --pre modin
Installing from the GitHub
pip install git+https://github.com/modin-project/modin
For installation on Windows, it is recommended to use the Dask Engine. Ray does not support Windows, so it will not be possible to install modin[ray] or modin[all]. It is possible to use Windows Subsystem For Linux (WSL), but this is generally not recommended due to the limitations and poor performance of Ray on WSL, a roughly 2-3x cost.
To install with the Dask engine, run the following using pip
pip install modin[dask]
Building Modin from Source
if you’re planning on contributing to Modin, you will need to ensure that you are building Modin from the local repository that you are working off of. Occasionally, there are issues in overlapping Modin installs from PyPI and from the source. To avoid these issues, we recommend uninstalling Modin before you install from the source
pip uninstall modin
Once cloned, cd into the modin directory and use pip to install:
cd modin pip install -e
Modin is an early stage DataFrame library that wraps pandas and transparently distributes the data and computation, accelerating your panda's workflows with one line of code change. The user does not need to know how many cores their system has, nor do they need to specify how to distribute the data. In fact, users can continue using their previous panda's notebooks while experiencing a considerable speedup from Modin, even on a single machine. The only modification of the import statement is needed, as we demonstrate below. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas, since the API is identical to pandas
# import pandas as pd import modin.pandas as pd
Modin manages the data partitioning and shuffling so that users can focus on extracting value from the data. The following code was run on a 2013 4-core iMac with 32GB
read_csv is by far the most used pandas’ operation. Let’s do a quick comparison when we use read_csv in pandas vs modin
pandas_csv_data = pandas.read_csv("../800MB.csv")-----------------------------------------------------------------CPU times: user 26.3 s, sys: 3.14 s, total: 29.4s
Wall time: 29.5 s
modin_csv_data = pd.read_csv("../800MB.csv")-----------------------------------------------------------------CPU times: user 76.7 ms, sys: 5.08 ms, total: 81.8 ms
Wall time: 7.6 s
read_csv performs up to 4x faster on a 4-core machine just by changing the import statement
Modin is still in its early stages and appears to be a very promising add on to the pandas. Modin handles all the partitioning and shuffling for the user so that we can essentially focus on our workflows. Modin’s basic goal is to enable the users to use the same tools on small data as well as big data without having to worry about changing the API to suit different data sizes.
Visit the Documentation for more information!
modin.pandas is currently under active development. Requests and contributions are welcome!