How to Easily Use Pandas 2.0 for Large Data

Table of Contents

Pandas 2.0 has more efficient way to handle and process large data with PyArrow

Pandas was initially built on NumPy, which has made Pandas the popular library. NumPy is an essential library for anyone working with numerical data in Python, providing a foundation for scientific computing, data analysis, and machine learning applications. However, it does have some limitations when it comes to handling large data due to memory limitations, limited support for out-of-core computing, limited support for parallel processing, etc.

Pandas 2.0.0 was officially released on April 2, 2023, which is a major release from the pandas 1 series. There are many changes and updates in the Panadas 2.0, and one of major update is the new Apache Arrow backend for pandas data.

Pandas 2.0 integrates Arrow to improve the performance and scalability of data processing. Arrow is an open-source software library for in-memory data processing that provides a standardized way to represent and manipulate data across different programming languages and platforms.

By integrating Arrow, Pandas 2.0 is able to take advantage of the efficient memory handling and data serialization capabilities of Arrow. This allows Pandas to process and manipulate large datasets more quickly and efficiently than before.

Some of the key benefits of integrating Arrow with Pandas include:

Improved performance: Arrow provides a more efficient way to handle and process data, resulting in faster computation times and reduced memory usage.
Scalability: Arrow’s standardized data format allows for easy integration with other data processing tools, making it easier to scale data processing pipelines.
Interoperability: Arrow provides a common data format that can be used across different programming languages and platforms, making it easier to share data between different systems.

Overall, integrating Arrow with Pandas 2.0 provides a significant performance boost and improves the scalability and interoperability of data processing workflows.

Install Pandas 2.0

You should install Pandas 2.0 or its updated version if you still use a previous version. The release can be installed from PyPI:

pip install — upgrade pandas>=2.0.0

Or from conda-forge

conda install -c conda-forge pandas>=2.0.0

for mamba user:

mamba install -c conda-forge pandas>=2.0.0

Install PyArrow

Install the latest version of PyArrow from conda-forge using Conda:

conda install -c conda-forge pyarrow

Install the latest version from PyPI:

pip install pyarrow

for namba user:

Import Required Libraries

First, let’s import required libraries. Besides pandas, we also need time to calculate programming running time.

import pandas as pd
import time

Read with Pandas Default NumPy Backend

We use Pandas default numPy backend to read the ‘yellow_tripdata_2019–01.csv’, a large dataset that can be downloaded from Kaggle.

start = time.time()
pd_df = pd.read_csv("./data/yellow_tripdata_2019-01.csv")
end = time.time()
pd_duration = end - start
print("Time to read data with pandas: {} seconds".format(round(pd_duration, 3)))

Time to read data with pandas: 13.185 seconds

pd_df.head()

Part of the result is as follows:

pd_df.shape

The result:(7667792, 18)

Thus, the dataset has 7667792 rows and 18 columns.

pd_df.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
RatecodeID                 int64
store_and_fwd_flag        object
PULocationID               int64
DOLocationID               int64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
congestion_surcharge     float64
dtype: object

Read Data with PyArrow Engine

Next, we check how it is when we read the dataset with PyArrow Engine.

start = time.time()
arrow_eng_df = pd.read_csv("./data/yellow_tripdata_2019-01.csv", engine='pyarrow')
end = time.time()
arrow_eng_duration = end -start
print("Time to read data with pandas pyarrow engine: {} seconds".format(round(arrow_eng_duration, 3)))

Time to read data with pandas pyarrow engine: 2.591 seconds

It only takes 2.591 seconds, which is much faster than that of Pandas default NumPy backend.

arrow_eng_df.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
RatecodeID                        int64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object

If you compare the data types, there is also some different, for example, ‘tpep_pickup_datetime’ and ‘tpep_dropoff_datetime’ are ‘datetime64[ns]
datetime64[ns]’.

Read Data with PyArrow Backend

Next, let’s see how the situation is going when we read the dataset using PyArrow backend.

start = time.time()
df_pyarrow_type = pd.read_csv("./data/yellow_tripdata_2019-01.csv", dtype_backend="pyarrow",low_memory=False)
end = time.time()
df_pyarrow_type_duration = end -start
print("Time to read data with pandas pyarrow datatype backend: {} seconds".format(round(df_pyarrow_type_duration, 3)))

Time to read data with pandas pyarrow datatype backend: 24.407 seconds

However, this method takes much larger than the previous two methods.

df_pyarrow_type.dtypes

VendorID                  int64[pyarrow]
tpep_pickup_datetime     string[pyarrow]
tpep_dropoff_datetime    string[pyarrow]
passenger_count           int64[pyarrow]
trip_distance            double[pyarrow]
RatecodeID                int64[pyarrow]
store_and_fwd_flag       string[pyarrow]
PULocationID              int64[pyarrow]
DOLocationID              int64[pyarrow]
payment_type              int64[pyarrow]
fare_amount              double[pyarrow]
extra                    double[pyarrow]
mta_tax                  double[pyarrow]
tip_amount               double[pyarrow]
tolls_amount             double[pyarrow]
improvement_surcharge    double[pyarrow]
total_amount             double[pyarrow]
congestion_surcharge     double[pyarrow]
dtype: object

Read Data with PyArrow Engine and Backend

Now we test the speed using engine='pyarrow' and dtype_backend="pyarrow".

start = time.time()
df_pyarrow = pd.read_csv(“./data/yellow_tripdata_2019–01.csv”,engine=’pyarrow’, dtype_backend=”pyarrow”)
end = time.time()
df_pyarrow_duration = end -start
print(“Time to read data with pandas pyarrow engine and datatype: {} seconds”.format(round(df_pyarrow_duration, 3)))

Time to read data with pandas pyarrow engine and datatype: 2.409 seconds

It only takes 2.409 seconds, which is very fast.

Copy-on-Write (CoW) Improvements

Copy-on-Write (CoW) is a technique first introduced in Pandas 1.5.0 to optimize memory usage when copying data. With CoW, when data is copied, the copy only contains pointers to the original data, rather than a full copy of the data itself. This can save memory and improve performance, as the original data can be shared between multiple copies until one of the copies is modified.

In Pandas 2.0, there have been several improvements to CoW that further optimize memory usage and performance, especially when working with smaller datasets or individual columns of data.

pd.options.mode.copy_on_write = True

start = time.time()
df_copy = pd.read_csv("./data/yellow_tripdata_2019-01.csv")
end = time.time()
df_copy_duration = end -start
print("Time to read data with pandas copy on write: {} seconds".format(round(df_copy_duration, 3)))

Time to read data with pandas copy on write: 17.187 seconds

Conclusion

Pandas 2.0 can greatly increase the speed of date processing. Compared with Pandas default NumPy backend, PyArrow Engine, PyArrow Backend, both PyArrow Engine and PyArrow Backend, it shows that using combination of PyArrow Engine and PyArrow backend has the fastest speed, and using PyArrow Engine also significantly speeds up data handling process.

Originally published at https://medium.com/ on June 26, 2023.

Bookmark

Please login to bookmark