First, make sure you have Vaex installed by checking the version of Vaex in Jupyter Notebook, you can simply run the following code snippet:
import vaex
vaex.__version__
{'vaex': '4.16.0', 'vaex-core': '4.16.1', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.14.1', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.3', 'vaex-jupyter': '0.8.1', 'vaex-ml': '0.18.1'}
The above information reveals that Vaex has been installed. Otherwise, you can run the following command in your terminal or Jupyter notebook directly to install it.
or
pip install vaex
Vaex supports a variety of data formats, including CSV, Parquet, HDF5, FITS, Arrow, and more. Here's how to read data using Vaex:
To read a CSV file with Vaex, you can use the vaex.open()
, vaex.from_csv()
or vaex.read_csv()
methods. In this example, we use 'yellow_tripdata_2019-01.csv', which can be downloaded from Kaggle by this link.
vaex.open()
¶import vaex
# read csv with vaex.open()
vdf = vaex.open('./data/yellow_tripdata_2019-01.csv')
vdf.head()
# | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7 | 0.5 | 0.5 | 1.65 | 0 | 0.3 | 9.95 | -- |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14 | 0.5 | 0.5 | 1 | 0 | 0.3 | 16.3 | -- |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 5.8 | -- |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 7.55 | -- |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0 | 2 | N | 193 | 193 | 2 | 52 | 0 | 0.5 | 0 | 0 | 0.3 | 55.55 | -- |
5 | 2 | 2018-11-28 16:25:49 | 2018-11-28 16:28:26 | 5 | 0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0 | 5.76 | 0.3 | 13.31 | -- |
6 | 2 | 2018-11-28 16:29:37 | 2018-11-28 16:33:43 | 5 | 0 | 2 | N | 193 | 193 | 2 | 52 | 0 | 0.5 | 0 | 0 | 0.3 | 55.55 | -- |
7 | 1 | 2019-01-01 00:21:28 | 2019-01-01 00:28:37 | 1 | 1.3 | 1 | N | 163 | 229 | 1 | 6.5 | 0.5 | 0.5 | 1.25 | 0 | 0.3 | 9.05 | -- |
8 | 1 | 2019-01-01 00:32:01 | 2019-01-01 00:45:39 | 1 | 3.7 | 1 | N | 229 | 7 | 1 | 13.5 | 0.5 | 0.5 | 3.7 | 0 | 0.3 | 18.5 | -- |
9 | 1 | 2019-01-01 00:57:32 | 2019-01-01 01:09:32 | 2 | 2.1 | 1 | N | 141 | 234 | 1 | 10 | 0.5 | 0.5 | 1.7 | 0 | 0.3 | 13 | -- |
Different from Pandas, df.head()
displays the first 9 rows by default.
vaex.from_csv()
¶vdf2 = vaex.from_csv('./data/yellow_tripdata_2019-01.csv')
vdf2.head()
# | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7 | 0.5 | 0.5 | 1.65 | 0 | 0.3 | 9.95 | nan |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14 | 0.5 | 0.5 | 1 | 0 | 0.3 | 16.3 | nan |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 5.8 | nan |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 7.55 | nan |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0 | 2 | N | 193 | 193 | 2 | 52 | 0 | 0.5 | 0 | 0 | 0.3 | 55.55 | nan |
5 | 2 | 2018-11-28 16:25:49 | 2018-11-28 16:28:26 | 5 | 0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0 | 5.76 | 0.3 | 13.31 | nan |
6 | 2 | 2018-11-28 16:29:37 | 2018-11-28 16:33:43 | 5 | 0 | 2 | N | 193 | 193 | 2 | 52 | 0 | 0.5 | 0 | 0 | 0.3 | 55.55 | nan |
7 | 1 | 2019-01-01 00:21:28 | 2019-01-01 00:28:37 | 1 | 1.3 | 1 | N | 163 | 229 | 1 | 6.5 | 0.5 | 0.5 | 1.25 | 0 | 0.3 | 9.05 | nan |
8 | 1 | 2019-01-01 00:32:01 | 2019-01-01 00:45:39 | 1 | 3.7 | 1 | N | 229 | 7 | 1 | 13.5 | 0.5 | 0.5 | 3.7 | 0 | 0.3 | 18.5 | nan |
9 | 1 | 2019-01-01 00:57:32 | 2019-01-01 01:09:32 | 2 | 2.1 | 1 | N | 141 | 234 | 1 | 10 | 0.5 | 0.5 | 1.7 | 0 | 0.3 | 13 | nan |
vaex.read_csv()
¶You can use read_csv
alternatively if you like the same API as pandas.
vdf3 = vaex.read_csv('./data/yellow_tripdata_2019-01.csv')
# display the first 5 rows
vdf3.head(5)
# | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7 | 0.5 | 0.5 | 1.65 | 0 | 0.3 | 9.95 | nan |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14 | 0.5 | 0.5 | 1 | 0 | 0.3 | 16.3 | nan |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 5.8 | nan |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 7.55 | nan |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0 | 2 | N | 193 | 193 | 2 | 52 | 0 | 0.5 | 0 | 0 | 0.3 | 55.55 | nan |
vaex.open()
reads data lazily rather than keeping it in RAM, while it can be more practical to simply read smaller datasets in memory with vaex.from_csv
and vaex.read_csv
using Pandas in the background.
Reading CSV files via Pandas can be slow. Apache Arrow provides a considerably faster of reading such files. Vaex conveniently exposes this functionality:
df_by_arrow = vaex.from_csv_arrow('./data/yellow_tripdata_2019-01.csv')
# display the first 3 rows
df_by_arrow.head(3)
# | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7 | 0.5 | 0.5 | 1.65 | 0 | 0.3 | 9.95 | -- |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14 | 0.5 | 0.5 | 1 | 0 | 0.3 | 16.3 | -- |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 5.8 | -- |
If the CSV file is too large to fit into RAM all at one time, we can convert the data to HDF5.
df_chunk = vaex.from_csv('./data/yellow_tripdata_2019-01.csv', convert=True, chunk_size=5_000_000)
df_chunk.head(3)
# | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7 | 0.5 | 0.5 | 1.65 | 0 | 0.3 | 9.95 | nan |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14 | 0.5 | 0.5 | 1 | 0 | 0.3 | 16.3 | nan |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 5.8 | nan |
Vaex will convert each chunk to a temporary HDF5 file on disk when reading the CSV in chunks. After all temporary files are concatenated into a single HDF5 file, the temporary files will be deleted. The chunk_size
argument is used to specify the size of the individual chunks to be read.
Similar to reading CSV, we can read json file using vaex.from_json
, which has the same arguments and file reading strategy as pandas.read_json
. We use the 'population_data.json' from GitHub, you can download it in the following link and read it locally. Similar to pandas, it can also be read directly using raw date link.
url = 'https://raw.githubusercontent.com/Malekai/Downloading-Data/master/population_data.json'
df_json = vaex.from_json(url)
df_json.head(3)
# | Country Name | Country Code | Year | Value |
---|---|---|---|---|
0 | Arab World | ARB | 1960 | 9.63881e+07 |
1 | Arab World | ARB | 1961 | 9.88825e+07 |
2 | Arab World | ARB | 1962 | 1.01474e+08 |
When the data is spread across multiple JSON files, a similar approach can be used like in the case of large CSV file or multiple CSV files. First, read each JSON file using the vaex.from_json
method and convert it to either HDF5 or Arrow file format. We can also use the vaex.open
to reads data lazily.
# Read a Parquet file with Vaex
df_parq = vaex.open('./data/yellow_tripdata_2022-01.parquet')
df_parq.head(5)
# | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2022-01-01 00:35:40 | 2022-01-01 00:53:29 | 2 | 3.8 | 1 | N | 142 | 236 | 1 | 14.5 | 3 | 0.5 | 3.65 | 0 | 0.3 | 21.95 | 2.5 | 0 |
1 | 1 | 2022-01-01 00:33:43 | 2022-01-01 00:42:07 | 1 | 2.1 | 1 | N | 236 | 42 | 1 | 8 | 0.5 | 0.5 | 4 | 0 | 0.3 | 13.3 | 0 | 0 |
2 | 2 | 2022-01-01 00:53:21 | 2022-01-01 01:02:19 | 1 | 0.97 | 1 | N | 166 | 166 | 1 | 7.5 | 0.5 | 0.5 | 1.76 | 0 | 0.3 | 10.56 | 0 | 0 |
3 | 2 | 2022-01-01 00:25:21 | 2022-01-01 00:35:23 | 1 | 1.09 | 1 | N | 114 | 68 | 2 | 8 | 0.5 | 0.5 | 0 | 0 | 0.3 | 11.8 | 2.5 | 0 |
4 | 2 | 2022-01-01 00:36:48 | 2022-01-01 01:14:20 | 1 | 4.3 | 1 | N | 68 | 163 | 1 | 23.5 | 0.5 | 0.5 | 3 | 0 | 0.3 | 30.3 | 2.5 | 0 |
# Read a Parquet file with Vaex
df_hdf = vaex.open('./data/sample_names_1.hdf5')
df_hdf
# | name | age | city |
---|---|---|---|
0 | John | 17 | Edinburgh |
1 | Sally | 33 | Groningen |
Vaex provides streaming of HDF5, Apache Arrow, Apache Parquet, and CSV files from Amazon’s S3 and Google Cloud Storage. For example, we take an example from its documentation on streaming an HDF5 file directly from S3.
df_s3 = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true')
df_s3.head(3)
# | vendor_id | pickup_datetime | dropoff_datetime | passenger_count | payment_type | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VTS | 2015-02-27 22:11:38.000000000 | 2015-02-27 22:22:51.000000000 | 5 | 1 | 2.26 | -74.0066 | 40.7075 | 1 | 0 | -74.0096 | 40.7346 | 10 | 0.5 | 0.5 | 2 | 0 | 13.3 |
1 | VTS | 2015-08-04 00:36:01.000000000 | 2015-08-04 00:47:11.000000000 | 1 | 1 | 5.13 | -74.0075 | 40.7052 | 1 | 0 | -73.9673 | 40.7552 | 16 | 0.5 | 0.5 | 3.46 | 0 | 20.76 |
2 | VTS | 2015-01-28 19:56:52.000000000 | 2015-01-28 20:03:27.000000000 | 1 | 2 | 1.89 | -73.9719 | 40.7629 | 1 | 0 | -73.9551 | 40.786 | 7.5 | 1 | 0.5 | 0 | 0 | 9.3 |
If a dataset is composed of multiple files, we can easily open them with vaex. You can download multiple files, such as CSV, Parquet, HDF5, or whatever, from the above provided links. In this example, we will use the first three months of 'Yellow Taxi Trip Records (PARQUET)' in 2022.
If you want to open all files simultaneously, you can easily do as follows:
df_all = vaex.open('./data/*.parquet')
df_all.head(5)
# | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7 | 0.5 | 0.5 | 1.65 | 0 | 0.3 | 9.95 | -- | -- |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14 | 0.5 | 0.5 | 1 | 0 | 0.3 | 16.3 | -- | -- |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 5.8 | -- | -- |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 7.55 | -- | -- |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0 | 2 | N | 193 | 193 | 2 | 52 | 0 | 0.5 | 0 | 0 | 0.3 | 55.55 | -- | -- |
Let's check its shape.
df_all.shape
(31590956, 19)
The dataset has 31,590,956 rows.
An alternative method is to use the open_many
method to pass the list of files that you want to open. For example, we just open two files.
file_list = ['./data/yellow_tripdata_2022-01.parquet',
'./data/yellow_tripdata_2022-02.parquet']
df_list = vaex.open_many(file_list)
df_list.head(5)
# | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2022-01-01 00:35:40 | 2022-01-01 00:53:29 | 2 | 3.8 | 1 | N | 142 | 236 | 1 | 14.5 | 3 | 0.5 | 3.65 | 0 | 0.3 | 21.95 | 2.5 | 0 |
1 | 1 | 2022-01-01 00:33:43 | 2022-01-01 00:42:07 | 1 | 2.1 | 1 | N | 236 | 42 | 1 | 8 | 0.5 | 0.5 | 4 | 0 | 0.3 | 13.3 | 0 | 0 |
2 | 2 | 2022-01-01 00:53:21 | 2022-01-01 01:02:19 | 1 | 0.97 | 1 | N | 166 | 166 | 1 | 7.5 | 0.5 | 0.5 | 1.76 | 0 | 0.3 | 10.56 | 0 | 0 |
3 | 2 | 2022-01-01 00:25:21 | 2022-01-01 00:35:23 | 1 | 1.09 | 1 | N | 114 | 68 | 2 | 8 | 0.5 | 0.5 | 0 | 0 | 0.3 | 11.8 | 2.5 | 0 |
4 | 2 | 2022-01-01 00:36:48 | 2022-01-01 01:14:20 | 1 | 4.3 | 1 | N | 68 | 163 | 1 | 23.5 | 0.5 | 0.5 | 3 | 0 | 0.3 | 30.3 | 2.5 | 0 |
We can easily construct a Vaex DataFrame from many other in-memory data representations., such as pandas DataFrame, arrow table, NumPy arrays, Python dict, etc.
In this example, we use pandas to read the 'top_six_economies.csv' from one of my GitHub repositories.
import pandas as pd
pd_df = pd.read_csv('https://raw.githubusercontent.com/shoukewei/data/main/data-pydm/top_six_economies.csv')
pd_df.head()
Unnamed: 0 | Country Name | Year | GDP (current US$) | GDP, PPP (current international $) | GDP per capita (current US$) | GDP growth (annual %) | Imports of goods and services (% of GDP) | Exports of goods and services (% of GDP) | Central government debt, total (% of GDP) | Total reserves (includes gold, current US$) | Unemployment, total (% of total labor force) (modeled ILO estimate) | Inflation, consumer prices (annual %) | Personal remittances, received (% of GDP) | Population, total | Population growth (annual %) | Life expectancy at birth, total (years) | Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 33 | United States | 1991 | 6.158129e+12 | 6.158129e+12 | 24342.258905 | -0.108265 | 10.125543 | 9.660905 | 44.061597 | 1.592729e+11 | 6.80 | 4.234964 | 0.021110 | 252981000.0 | 1.336261 | 75.365854 | 0.5 |
1 | 34 | United States | 1992 | 6.520327e+12 | 6.520327e+12 | 25418.990776 | 3.522441 | 10.241680 | 9.708915 | 46.050144 | 1.475259e+11 | 7.50 | 3.028820 | 0.027545 | 256514000.0 | 1.386886 | 75.617073 | 0.5 |
2 | 35 | United States | 1993 | 6.858559e+12 | 6.858559e+12 | 26387.293734 | 2.751781 | 10.497438 | 9.547180 | 48.246140 | 1.646202e+11 | 6.90 | 2.951657 | 0.026536 | 259919000.0 | 1.318680 | 75.419512 | 0.5 |
3 | 36 | United States | 1994 | 7.287236e+12 | 7.287236e+12 | 27694.853416 | 4.028793 | 11.162312 | 9.893147 | 47.353482 | 1.635906e+11 | 6.12 | 2.607442 | 0.026663 | 263126000.0 | 1.226296 | 75.619512 | 0.5 |
4 | 37 | United States | 1995 | 7.639749e+12 | 7.639749e+12 | 28690.875701 | 2.684217 | 11.814158 | 10.639224 | 47.209535 | 1.759954e+11 | 5.65 | 2.805420 | 0.028522 | 266278000.0 | 1.190787 | 75.621951 | 0.5 |
Then we convert pandas DateFrame into vaex DateFrame using vaex.from_pandas
method.
vx_df = vaex.from_pandas(df=pd_df, copy_index=True)
vx_df.head(5)
# | Unnamed: 0 | Country Name | Year | GDP (current US$) | GDP, PPP (current international $) | GDP per capita (current US$) | GDP growth (annual %) | Imports of goods and services (% of GDP) | Exports of goods and services (% of GDP) | Central government debt, total (% of GDP) | Total reserves (includes gold, current US$) | Unemployment, total (% of total labor force) (modeled ILO estimate) | Inflation, consumer prices (annual %) | Personal remittances, received (% of GDP) | Population, total | Population growth (annual %) | Life expectancy at birth, total (years) | Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population) | index |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 33 | United States | 1991 | 6.15813e+12 | 6.15813e+12 | 24342.3 | -0.108265 | 10.1255 | 9.66091 | 44.0616 | 1.59273e+11 | 6.8 | 4.23496 | 0.0211103 | 2.52981e+08 | 1.33626 | 75.3659 | 0.5 | 0 |
1 | 34 | United States | 1992 | 6.52033e+12 | 6.52033e+12 | 25419 | 3.52244 | 10.2417 | 9.70891 | 46.0501 | 1.47526e+11 | 7.5 | 3.02882 | 0.0275446 | 2.56514e+08 | 1.38689 | 75.6171 | 0.5 | 1 |
2 | 35 | United States | 1993 | 6.85856e+12 | 6.85856e+12 | 26387.3 | 2.75178 | 10.4974 | 9.54718 | 48.2461 | 1.6462e+11 | 6.9 | 2.95166 | 0.0265362 | 2.59919e+08 | 1.31868 | 75.4195 | 0.5 | 2 |
3 | 36 | United States | 1994 | 7.28724e+12 | 7.28724e+12 | 27694.9 | 4.02879 | 11.1623 | 9.89315 | 47.3535 | 1.63591e+11 | 6.12 | 2.60744 | 0.0266631 | 2.63126e+08 | 1.2263 | 75.6195 | 0.5 | 3 |
4 | 37 | United States | 1995 | 7.63975e+12 | 7.63975e+12 | 28690.9 | 2.68422 | 11.8142 | 10.6392 | 47.2095 | 1.75995e+11 | 5.65 | 2.80542 | 0.0285219 | 2.66278e+08 | 1.19079 | 75.622 | 0.5 | 4 |
To transfer a Pandas DataFrame into a Vaex DataFrame, the copy_index
argument determines whether the index column should also be included. This conversion is very helpful because Pandas can read data from a wide range of file formats. For example, we can use Pandas to extract data from a database and then transfer it to Vaex DataFrame.
Similarly, We can read an arrow table as a Vaex DataFrame. In this example, let's first use pyarrow to read in yellow_tripdata_2019-01.csv
file as an arrow table.
import pyarrow.csv
arrow_table = pyarrow.csv.read_csv('./data/yellow_tripdata_2019-01.csv')
arrow_table
pyarrow.Table VendorID: int64 tpep_pickup_datetime: timestamp[s] tpep_dropoff_datetime: timestamp[s] passenger_count: int64 trip_distance: double RatecodeID: int64 store_and_fwd_flag: string PULocationID: int64 DOLocationID: int64 payment_type: int64 fare_amount: double extra: double mta_tax: double tip_amount: double tolls_amount: double improvement_surcharge: double total_amount: double congestion_surcharge: double ---- VendorID: [[1,1,2,2,2,...,1,1,1,1,1],[1,1,2,2,2,...,1,1,1,2,2],...,[2,2,2,2,2,...,2,2,2,2,2],[2,2,2,2,2,...,2,2,2,2,2]] tpep_pickup_datetime: [[2019-01-01 00:46:40,2019-01-01 00:59:47,2018-12-21 13:48:30,2018-11-28 15:52:25,2018-11-28 15:56:57,...,2019-01-01 00:49:39,2019-01-01 00:17:00,2019-01-01 00:38:22,2019-01-01 00:34:42,2019-01-01 00:18:59],[2019-01-01 00:28:05,2019-01-01 00:37:25,2019-01-01 00:11:49,2019-01-01 00:48:52,2019-01-01 00:25:23,...,2019-01-01 01:55:27,2019-01-01 01:38:49,2019-01-01 01:58:30,2019-01-01 01:00:53,2019-01-01 01:17:20],...,[2019-01-31 22:01:43,2019-01-31 22:21:26,2019-01-31 22:33:50,2019-01-31 22:52:26,2019-01-31 22:07:42,...,2019-01-31 23:10:38,2019-01-31 23:05:18,2019-01-31 23:13:49,2019-01-31 23:16:20,2019-01-31 23:39:43],[2019-01-31 23:50:45,2019-01-31 23:03:33,2019-01-31 23:19:27,2019-01-31 23:47:18,2019-01-31 23:08:38,...,2019-01-31 23:57:36,2019-01-31 23:32:03,2019-01-31 23:36:36,2019-01-31 23:14:53,2019-01-31 23:12:49]] tpep_dropoff_datetime: [[2019-01-01 00:53:20,2019-01-01 01:18:59,2018-12-21 13:52:40,2018-11-28 15:55:45,2018-11-28 15:58:33,...,2019-01-01 01:10:24,2019-01-01 00:36:19,2019-01-01 01:05:05,2019-01-01 01:04:57,2019-01-01 00:25:25],[2019-01-01 00:35:01,2019-01-01 00:53:28,2019-01-01 00:29:58,2019-01-01 01:11:01,2019-01-01 00:37:18,...,2019-01-01 02:14:19,2019-01-01 01:47:41,2019-01-01 02:15:04,2019-01-01 01:14:59,2019-01-01 01:48:34],...,[2019-01-31 22:17:05,2019-01-31 22:27:54,2019-01-31 22:41:28,2019-01-31 23:03:29,2019-01-31 22:10:35,...,2019-01-31 23:17:07,2019-01-31 23:09:56,2019-01-31 23:22:46,2019-01-31 23:26:11,2019-01-31 23:42:09],[2019-02-01 00:01:17,2019-01-31 23:35:51,2019-01-31 23:36:55,2019-01-31 23:53:44,2019-01-31 23:16:52,...,2019-02-01 00:18:39,2019-01-31 23:33:11,2019-01-31 23:36:40,2019-01-31 23:15:20,2019-01-31 23:14:08]] passenger_count: [[1,1,3,5,5,...,2,1,1,1,1],[1,1,1,1,1,...,2,1,1,1,1],...,[1,1,1,1,2,...,1,1,1,1,1],[1,1,3,2,5,...,1,1,1,1,1]] trip_distance: [[1.5,2.6,0,0,0,...,4.6,4.1,5.5,3.7,1.2],[1,6,2.75,5.46,1.84,...,4.8,1.4,4.3,1.84,7.47],...,[1.22,0.64,2.62,1.13,0.63,...,0.6,0.99,1.51,1.93,0.48],[2.62,19.93,5.22,0.88,2.07,...,4.79,0,0,0,0]] RatecodeID: [[1,1,1,1,2,...,1,1,1,1,1],[1,1,1,1,1,...,1,1,1,1,1],...,[1,1,1,1,1,...,1,1,1,1,1],[1,2,1,1,1,...,1,1,1,1,1]] store_and_fwd_flag: [["N","N","N","N","N",...,"N","N","N","N","N"],["N","N","N","N","N",...,"N","N","N","N","N"],...,["N","N","N","N","N",...,"N","N","N","N","N"],["N","N","N","N","N",...,"N","N","N","N","N"]] PULocationID: [[151,239,236,193,193,...,233,114,255,246,50],[142,239,144,164,164,...,181,148,249,162,79],...,[234,234,162,237,161,...,186,148,230,68,186],[164,132,90,236,161,...,263,193,264,264,193]] DOLocationID: [[239,246,236,193,193,...,151,255,61,246,142],[239,243,162,25,4,...,80,231,112,79,41],...,[170,137,236,163,161,...,249,232,237,137,90],[141,261,74,262,68,...,4,193,264,7,193]] payment_type: [[1,1,1,2,2,...,1,1,1,1,1],[2,1,1,1,1,...,1,2,1,2,1],...,[2,1,1,1,2,...,1,2,2,1,1],[1,1,1,1,1,...,1,1,1,1,1]] ...
Next, we can easily convert the arrow table into a Vaex DataFrame.
vx_df = vaex.from_arrow_table(arrow_table)
vx_df.head(5)
# | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7 | 0.5 | 0.5 | 1.65 | 0 | 0.3 | 9.95 | -- |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14 | 0.5 | 0.5 | 1 | 0 | 0.3 | 16.3 | -- |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 5.8 | -- |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 7.55 | -- |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0 | 2 | N | 193 | 193 | 2 | 52 | 0 | 0.5 | 0 | 0 | 0.3 | 55.55 | -- |
Besides, we can also easily convert a NumPy array, Python dict, scalar into a Vaex DataFrame using vaex.from_arrays
,vaex.from_dict
, vaex.from_scalars
, respectively.
Vaex provides several options for exporting data, including CSV, Parquet, HDF5, FITS, Arrow, and more. Here's how to export data using Vaex:
df.export_csv('output.csv')
df.export_hdf5('output.hdf5')
df.export_arrow('output.arrow')
df.export_parquet(output.parquet')
Alternatively, we can simply use:
df.export('output.csv')
df.export('output.hdf5')
df.export('output.arrow')
df.export(output.parquet')
In the following examples, we will use the first methods because they clearly show which file format will be export to. As follows, we only discuss three popular formats, namely CSV, Parquet and HDF5.
To export a Vaex dataframe to a CSV file, you can use the dataframe.export_csv ()
method. For example, we export df_json
Vaex DataFrame above to a CSV file.
# Export a Vaex dataframe to a CSV file
df_json.export_csv('./output/population_data.csv')
The file with name 'population_data.csv' was saved into the output folder in the current working directory. Vaex DataFrame is exported in chunks so that a large data that cannot fit in RAM can be saved to disk. The chunk_size
argument in the method is used to specify the size of those chunks, and chunk_size
has a default value of 1_000_000.
You'd better export a larger DataFrame to CSV through the Apache Arrow backend because it provides better performance. For example, we can do that in the following way.
# Export a Vaex dataframe to a csv file via Arrow
df_json.export_csv_arrow('./output/population_data.csv')
To export a Vaex dataframe to a Parquet file, you can use the vaex.export()
method. In this example, we export the Vaex DataFrame of 'top_six_economies' to a parquet file.
# Export a Vaex dataframe to a Parquet file
vx_df.export_parquet('./output/top_six_economies.parquet')
You can specify the chunk size via the chunk_size
argument, and the default value of chunk size is 1048576.
To export data from a Vaex dataframe to a HDF5, you can use the vaex.export_hdf5()
method or just vaex.export()
. Here, we export 'yellow_tripdata_2019-01' DataFrame to a HDF5 file with the same file name. When exporing to HDF5, we can specify a particular group.
# Export data from a Vaex dataframe to a hdf5
df_by_arrow.export_hdf5('./output/yellow_tripdata_2019-01.hdf5', group='trip2019')
In this tutorial article, we explored how to read and export data using Vaex and its various formats. We also looked at how to easily construct a Vaex DataFrame from many other in-memory data representations, such as pandas DataFrame, arrow table, NumPy arrays, Python dict, etc. With its memory-mapped and columnar approach, Vaex offers an efficient and scalable solution for handling large datasets. By incorporating Vaex into your data pipeline, you can save time and reduce memory usage, while still maintaining the flexibility and ease of use of Pandas-like API.