Selecting the columns needed while reading the dataset will save the memory
In data analysis and manipulation tasks, it’s often necessary to work with specific columns of a dataset rather than the entire dataset. Pandas, a popular data analysis library in Python, provides powerful capabilities for selecting and manipulating columns in a DataFrame. One of its main features is the ability to read and write data from various file formats, including CSV files. In this article, we will explore how to select specific columns while reading a CSV file using pandas.
Reading the data
To read a CSV file with Pandas, you can use the read_csv()
function. In the following example, we will read the apple_share_price.csv
dataset from a GitHub repository.
import pandas as pd
url = 'https://raw.githubusercontent.com/NourozR/Stock-Price-Prediction-LSTM/master/apple_share_price.csv'
df = pd.read_csv(url)<br>df.head()
By default, this will read the entire dataset into a Pandas DataFrame. When you want only certain columns, you might select them by their names or index. The methods on selecting (or filtering) columns have been discussed in this previous article.
However, often times you might want to select only certain columns while reading the dataset. To do this, you can use the usecols
parameter of the read_csv()
function. By leveraging the usecols
parameter in the read_csv()
function, we can conveniently specify the columns we want to include in the resulting DataFrame.
Selecting Columns by Names While Reading Data
You can specify the columns you want to use by their names while reading the dataset. For example, if you want to use the first and third columns of your dataset, you can do so like this:
import pandas as pd
url = 'https://raw.githubusercontent.com/NourozR/Stock-Price-Prediction-LSTM/master/apple_share_price.csv'
df = pd.read_csv(url, usecols=['Date', 'High'])
df.head()
In the example above, we have specified that we only want to load Date
and High
. If you have a large dataset with many columns, this can be a handy way to reduce the amount of memory used by your program.
Selecting Columns by Index While Reading Data
You can also specify the columns you want to use by their index position. For example, if you want to use the first and third columns of your dataset, you can also do so as follows:
import pandas as pd
url = 'https://raw.githubusercontent.com/NourozR/Stock-Price-Prediction-LSTM/master/apple_share_price.csv'
df = pd.read_csv(url, usecols=[0,2])
df.head()
This technique can be especially useful when working with large datasets where using column names would be more time-consuming due to overhead.
Filtering Columns and Rows While Reading Data
As an additional point to remember, you may also need to filter rows based on some condition while reading CSV data. You could do this with the combo of filtering rows (index slices) and selecting columns (by index or name):
import pandas as pd
url = 'https://raw.githubusercontent.com/NourozR/Stock-Price-Prediction-LSTM/master/apple_share_price.csv'
# Read only rows between 10 to 20, and select first 3 columns
df = pd.read_csv(url, usecols=[0,1,2], skiprows=range(1,10), nrows=10)
df
In this example, we selected the rows starting from the 10th to 20th using the skiprows
parameter. And, specified the number of rows to include after skipping the initial (filtered) rows with nrows
parameter.
Conclusion
Selecting specific columns while reading a CSV file using pandas is a common requirement in data analysis and manipulation tasks. With the usecols parameter of the read_csv()
function, pandas offers a straightforward approach to directly include only the desired columns in the resulting DataFrame. By following the steps outlined in this tutorial, you can effectively read a CSV file and immediately access the columns of interest. Pandas’ flexibility and functionality make it a valuable tool for working with structured data, and mastering column selection techniques is an essential skill for any data analyst or scientist.
Originally published at https://medium.com/@shouke.wei on June 12, 2023.