If you have datasets in a folder or from different places, you can easily use this method to read them and combine them into one. In the following example, I read only three .csv
files downloaded from the kaggle in the data
folder of my current working directory (CWD). Let's read and combine them.
In this article, we only need Pandas, so we import it first.
import pandas as pd
First, we create a list for these file, which are composed of their paths.
data_list = ['./data/yellow_tripdata_2019-01.csv',
'./data/yellow_tripdata_2019-02.csv',
'./data/yellow_tripdata_2019-03.csv'
]
data_list
['./data/yellow_tripdata_2019-01.csv', './data/yellow_tripdata_2019-02.csv', './data/yellow_tripdata_2019-03.csv']
We can use map()
method and concat()
to read the datasets from the list and then combine them into one DataFrame.
df = pd.concat(map(pd.read_csv, data_list))
df.head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7.0 | 0.5 | 0.5 | 1.65 | 0.0 | 0.3 | 9.95 | NaN |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14.0 | 0.5 | 0.5 | 1.00 | 0.0 | 0.3 | 16.30 | NaN |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0.0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 5.80 | NaN |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0.0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 7.55 | NaN |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0.0 | 2 | N | 193 | 193 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 0.0 | 0.3 | 55.55 | NaN |
This method is suitable when you need to read all files in the same format like .csv
in a folder in this example.
Besides pandas, we also need glob. Actually, glob is a module of Python's standard library. It is used to find the files and folders whose names match a specific pattern.
import glob
import pandas as pd
In this example, we find all the .csv
files in the yellow_taxi_data
folder in the CWD.
path = './yellow_taxi_data'
files = glob.glob(path + "/*.csv")
files
['./yellow_taxi_data\\yellow_tripdata_2019-01.csv', './yellow_taxi_data\\yellow_tripdata_2019-02.csv', './yellow_taxi_data\\yellow_tripdata_2019-03.csv']
In this step, we use a for loop
to make pandas read each CSV file into DataFrame, and then it forms a DataFrame list.
df_list = (pd.read_csv(file) for file in files)
Then we concatenate the list of all the DataFrame into a big Dataframe.
df = pd.concat(df_list, ignore_index=True)
df.head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7.0 | 0.5 | 0.5 | 1.65 | 0.0 | 0.3 | 9.95 | NaN |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14.0 | 0.5 | 0.5 | 1.00 | 0.0 | 0.3 | 16.30 | NaN |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0.0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 5.80 | NaN |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0.0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 7.55 | NaN |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0.0 | 2 | N | 193 | 193 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 0.0 | 0.3 | 55.55 | NaN |
map()
method instead of loop
¶To read multiple datasets from a folder, we can also use map()
method instead of loop
. This method is a combination methods used in sections 1 and 2.
path = './yellow_taxi_data'
df = pd.concat(map(pd.read_csv, glob.glob(path + "/*.csv")))
df.head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7.0 | 0.5 | 0.5 | 1.65 | 0.0 | 0.3 | 9.95 | NaN |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14.0 | 0.5 | 0.5 | 1.00 | 0.0 | 0.3 | 16.30 | NaN |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0.0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 5.80 | NaN |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0.0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 7.55 | NaN |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0.0 | 2 | N | 193 | 193 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 0.0 | 0.3 | 55.55 | NaN |
This article introduces 3 convenient methods to read and concatenate multiple datasets into a Dataframe using Python Pandas. However, Pandas will become very slow when the dataset files are many and large, because Pandas is not developed for big data analysis. In the future, we discuss the topics of big data analysis using these datasets.