It is easily to install Modin dependencies and modin's execution engines using pip
.
The easy way is to install the whole library and engines.
pip install modin[all]
You can just intall Modin dependencies and ray to run Modin on Ray.
pip install modin[ray]
Or just install Modin dependencies and Dask to run on Dask.
pip install modin[dask]
Or just install Modin dependencies and unidist to run on unidist.
pip install modin[unidist]
To continue using your existing pandas code, just replace the line import pandas as pd
with import modin.pandas as pd
.
# import pandas as pd
import modin.pandas as pd
Modin also allows you to specify which engine to use for computation with environment variable MODIN_ENGINE
within a notebook/interpreter before you import Modin. Don't run it now, and we will run it when we use it, say in the examples 5 and 6 below.
import os
os.environ["MODIN_ENGINE"] = "ray" # Modin uses Ray
os.environ["MODIN_ENGINE"] = "dask" # Modin uses Dask
os.environ["MODIN_ENGINE"] = "unidist" # Modin will use Unidist
os.environ["UNIDIST_BACKEND"] = "mpi" # Unidist will use MPI backend
The second method is to use modin_cfg
or unidist_cfg
methods.
import modin.config as modin_cfg
import unidist.config as unidist_cfg
modin_cfg.Engine.put("ray") # Modin will use Ray
modin_cfg.Engine.put("dask") # Modin will use Dask
modin_cfg.Engine.put('unidist') # Modin will use Unidist
unidist_cfg.Backend.put('mpi') # Unidist will use MPI backend
To compare the speed of reading data using Pandas, Modin, Modin(Ray) and Modin(Dask), we use the time module of the standard Python library to calculate the data loading time. You can also use %time
or %%time
instead in the Jupyter notebook to show the time of running a code snippet.
We will use three .csv
files in the yellow_taxi_data
folder in the current working directory.
import glob
path = './yellow_taxi_data'
files = glob.glob(path + "/*.csv")
files
['./yellow_taxi_data\\yellow_tripdata_2019-01.csv', './yellow_taxi_data\\yellow_tripdata_2019-02.csv', './yellow_taxi_data\\yellow_tripdata_2019-03.csv']
Similarly as last article, first we read the dataset with Pandas, but we calculate the data loading time using Python time
module.
import pandas as pd
import time
start = time.time()
pd_df = pd.concat(map(pd.read_csv, files))
end = time.time()
pd_duration = end - start
print("Time to read data with pandas: {} seconds".format(round(pd_duration, 3)))
Time to read data with pandas: 48.634 seconds
pd_df.head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7.0 | 0.5 | 0.5 | 1.65 | 0.0 | 0.3 | 9.95 | NaN |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14.0 | 0.5 | 0.5 | 1.00 | 0.0 | 0.3 | 16.30 | NaN |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0.0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 5.80 | NaN |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0.0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 7.55 | NaN |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0.0 | 2 | N | 193 | 193 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 0.0 | 0.3 | 55.55 | NaN |
pd_df.shape
(22519712, 18)
It takes Pandas 48.634 seconds to load the dataset, although it has only 1.92 GB with 22,519,712 rows and 18 columns.
Now, it displays how to read data with Modin, and we will check if Modin is faster than Pandas.
start = time.time()
md_df = md.concat(map(md.read_csv, files))
end = time.time()
md_duration = end - start
print("Time to read data with modin: {} seconds".format(round(md_duration, 3)))
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
import ray
ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
2023-02-11 10:07:13,219 INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
Time to read data with modin: 30.416 seconds
There is warning at the beginning, but it is not a problem. You can remove the warning by adding the code as it says. Comparing with Pandas, the speed of loading data is increased by over 18 seconds.
# for example, we show the first five rows
md_df.head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7.0 | 0.5 | 0.5 | 1.65 | 0.0 | 0.3 | 9.95 | NaN |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14.0 | 0.5 | 0.5 | 1.00 | 0.0 | 0.3 | 16.30 | NaN |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0.0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 5.80 | NaN |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0.0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 7.55 | NaN |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0.0 | 2 | N | 193 | 193 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 0.0 | 0.3 | 55.55 | NaN |
We use the first method introduced in Section 2 to specify the Ray
engine, and check if the speed of reading the same data can be improved.
import os
os.environ["MODIN_ENGINE"] = "ray" # Modin will use Ray
import ray
import modin.pandas as rd
start = time.time()
rd_df = md.concat(map(rd.read_csv, files))
end = time.time()
rd_duration = end - start
print("Time to read data with modin and ray: {} seconds".format(round(rd_duration, 3)))
Time to read data with modin and ray: 15.785 seconds
From the above running result, we can clearly see that the speed of loading dataset is greatly increased with Modin and Ray.
Next, let's specify the Dask
engine and see the speed of data reading. Besides, we can also limit the memory to use, which can be done by removing the comment sign #
in the following code snippet.
import os
os.environ["MODIN_ENGINE"] = "dask" # Modin will use Dask
#from distributed import Client
#client = Client(memory_limit='8GB')
import modin.pandas as dd
start = time.time()
dd_df = md.concat(map(rd.read_csv, files))
end = time.time()
dd_duration = end - start
print("Time to read data with modin and dask: {} seconds".format(round(dd_duration, 3)))
Time to read data with modin and dask: 24.772 seconds
The above result shows that Modin with Dask also speed up the loading speed, but the speed is slower than that using Modin with Ray.
This article briefly introduced Modin and mainly displayed how Modin speed up the Pandas code by reading a 1.92 GB dataset with 22,519,712 rows. From the results, we can see that Pandas loading speed is greatly deteriorated, although the dataset is not really large, while Modin with Ray can greatly speed up the reading speed of Pandas. The major advantage of Modin enables to speed up Pandas by only changing the import statement, and it is really helpful for manipulation and analysis of all dataframes from 1 MB To 1 TB+.