Use real-world dataset to display how to filter dataset with different methods
In the last two articles, I displayed how to slice a dataset and select the slices, how to sort the dataset using a real-world dataset with Python Pandas. In this article, we will continue using the same dataset to learn how to filter the dataset.
Let’s read the dataset first before starting the topic.
# Load the required packages
import pandas as pd
# Read the data
df = pd.read_csv('./data/gdp_china_renamed.csv')
# diplay the first 5 rows
1. Filter Method
DataFrame.filter(items=None, like=None, regex=None, axis=None)is used to subset the DataFrame rows or columns according to the specified index labels.
(1) Filter columns
(i) select columns by name
items parameter is used to filter columns by name. For example, filter the three columns as follows.
(ii) select columns by regular expression
We can filter the columns that contain a character or characters using
regex. For example, we filter the columns contains
regex='p$' will select the columns that contain
p at the end/
(iii) select columns by like parameter
regex parameter, we can use
like parameter to filter columns that contain a character or characters.
(3) filter rows
For example, filter rows whose index contains 4.
(i) 4 is at the end
(ii) 4 can be anywhere
2. filter rows by conditions
For example, filter rows in which pop is greater than 10.
Boolean operator used to subset the data:
>= greater or equal;
<= less or equal;
!= not equal;
Besides, we can filter string columns. For example, we select the DataFrame in which
gdprstarts with string’F’.
3. filter by multiple conditions
(1) Using loc with multiple conditions
For example, we filter the rows where
pop is greater than 10,
gdp is greater than 6,
gdpr starts with string ‘F’.
(2) Using NumPy
where with multiple conditions
We can achieve the above results using the Numpy
import numpy as np
idx = np.where((df['pop']>10)&(df['gdp']>6)&(df['gdpr'].str.startswith('F')))
(array([13, 14, 15, 16], dtype=int64),)
# Dsiply it as a DataFrame
Query with multiple conditions
Now, let’s use pandas
query() function to do the above task as follows.
df.query("pop>10 & gdp>6&gdpr.str.startswith('F')")
eval with multiple conditions
Besides, we can also use Pandas
eval to filter the DataFrame.
df[df.eval("pop>10 & gdp>6&gdpr.str.startswith('F')")]
4. Online course
If you are interested in learning data analysis in details, you are welcome to enroll one of my course:
Master Python Data Analysis and Modelling Essentials