VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	congestion_surcharge
0	1	2019-01-01 00:46:40	2019-01-01 00:53:20	1	1.5	1	N	151	239	1	7.0	0.5	0.5	1.65	0.3	9.95	NaN
1	1	2019-01-01 00:59:47	2019-01-01 01:18:59	1	2.6	1	N	239	246	1	14.0	0.5	0.5	1.00	0.3	16.30	NaN
2	2	2018-12-21 13:48:30	2018-12-21 13:52:40	3	0.0	1	N	236	236	1	4.5	0.5	0.5	0.00	0.3	5.80	NaN
3	2	2018-11-28 15:52:25	2018-11-28 15:55:45	5	0.0	1	N	193	193	2	3.5	0.5	0.5	0.00	0.3	7.55	NaN
4	2	2018-11-28 15:56:57	2018-11-28 15:58:33	5	0.0	2	N	193	193	2	52.0	0.0	0.5	0.00	0.3	55.55	NaN

In [11]:

pd_df.shape

Out[11]:

(22519712, 18)

It takes Pandas 48.634 seconds to load the dataset, although it has only 1.92 GB with 22,519,712 rows and 18 columns.

(3) Read data with Modin¶

Now, it displays how to read data with Modin, and we will check if Modin is faster than Pandas.

In [14]:

start = time.time()
md_df = md.concat(map(md.read_csv, files))
end = time.time()
md_duration = end - start
print("Time to read data with modin: {} seconds".format(round(md_duration, 3)))

UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

    import ray
    ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})

2023-02-11 10:07:13,219	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265

Time to read data with modin: 30.416 seconds

There is warning at the beginning, but it is not a problem. You can remove the warning by adding the code as it says. Comparing with Pandas, the speed of loading data is increased by over 18 seconds.

In [20]:

# for example, we show the first five rows
md_df.head()

Out[20]:

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	congestion_surcharge
0	1	2019-01-01 00:46:40	2019-01-01 00:53:20	1	1.5	1	N	151	239	1	7.0	0.5	0.5	1.65	0.3	9.95	NaN
1	1	2019-01-01 00:59:47	2019-01-01 01:18:59	1	2.6	1	N	239	246	1	14.0	0.5	0.5	1.00	0.3	16.30	NaN
2	2	2018-12-21 13:48:30	2018-12-21 13:52:40	3	0.0	1	N	236	236	1	4.5	0.5	0.5	0.00	0.3	5.80	NaN
3	2	2018-11-28 15:52:25	2018-11-28 15:55:45	5	0.0	1	N	193	193	2	3.5	0.5	0.5	0.00	0.3	7.55	NaN
4	2	2018-11-28 15:56:57	2018-11-28 15:58:33	5	0.0	2	N	193	193	2	52.0	0.0	0.5	0.00	0.3	55.55	NaN

(4) Read data with Modin and Ray¶

We use the first method introduced in Section 2 to specify the Ray engine, and check if the speed of reading the same data can be improved.

In [15]:

import os
os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
import ray
import modin.pandas as rd

start = time.time()
rd_df = md.concat(map(rd.read_csv, files))
end = time.time()
rd_duration = end - start
print("Time to read data with modin and ray: {} seconds".format(round(rd_duration, 3)))

Time to read data with modin and ray: 15.785 seconds

From the above running result, we can clearly see that the speed of loading dataset is greatly increased with Modin and Ray.

(5) Read data with Modin and Dask¶

Next, let's specify the Dask engine and see the speed of data reading. Besides, we can also limit the memory to use, which can be done by removing the comment sign # in the following code snippet.

In [16]:

import os
os.environ["MODIN_ENGINE"] = "dask"  # Modin will use Dask

#from distributed import Client
#client = Client(memory_limit='8GB')
import modin.pandas as dd

start = time.time()
dd_df = md.concat(map(rd.read_csv, files))
end = time.time()
dd_duration = end - start
print("Time to read data with modin and dask: {} seconds".format(round(dd_duration, 3)))

Time to read data with modin and dask: 24.772 seconds

The above result shows that Modin with Dask also speed up the loading speed, but the speed is slower than that using Modin with Ray.

Summary¶

This article briefly introduced Modin and mainly displayed how Modin speed up the Pandas code by reading a 1.92 GB dataset with 22,519,712 rows. From the results, we can see that Pandas loading speed is greatly deteriorated, although the dataset is not really large, while Modin with Ray can greatly speed up the reading speed of Pandas. The major advantage of Modin enables to speed up Pandas by only changing the import statement, and it is really helpful for manipulation and analysis of all dataframes from 1 MB To 1 TB+.

Table of Contents

1. Install Modin¶

(1) Install the full library¶

(2) Install Modin dependencies and Ray¶

(3) Install Modin dependencies and Dask¶

(4) Install Modin dependencies and Unidist¶

2. Import Modin Library¶

3. Read data with Pandas and Modin¶

(1) Use dataset from a folder¶

(2) Read data with pandas¶

(3) Read data with Modin¶

(4) Read data with Modin and Ray¶

(5) Read data with Modin and Dask¶

Summary¶