Use real-world dataset to display how to filter dataset with different methods
In the last two articles, I displayed how to slice a dataset and select the slices, how to sort the dataset using a real-world dataset with Python Pandas. In this article, we will continue using the same dataset to learn how to filter the dataset.
Let’s read the dataset first before starting the topic.
# Load the required packages
import pandas as pd
# Read the data
df = pd.read_csv('./data/gdp_china_renamed.csv')
# diplay the first 5 rows
df.head()
1. Filter Method
DataFrame.filter(items=None, like=None, regex=None, axis=None)
is used to subset the DataFrame rows or columns according to the specified index labels.
(1) Filter columns
(i) select columns by name
The items
parameter is used to filter columns by name. For example, filter the three columns as follows.
df.filter(items=['prov','year','gdp'])
df.head()
(ii) select columns by regular expression
We can filter the columns that contain a character or characters using regex
. For example, we filter the columns contains p
.
df.filter(regex='p',axis=1)
However, regex='p$'
will select the columns that contain p
at the end/
df.filter(regex='p
(iii) select columns by like parameter
Similar with regex
parameter, we can use like
parameter to filter columns that contain a character or characters.
df.filter(like='op',axis=1)
(3) filter rows
For example, filter rows whose index contains 4.
(i) 4 is at the end
df.filter(regex='4
(ii) 4 can be anywhere
df.filter(like='4',axis=0)
2. filter rows by conditions
For example, filter rows in which pop is greater than 10.
df[df['pop']>10]
Boolean operator used to subset the data:
> greater;
>= greater or equal;
< less;
<= less or equal;
== equal;
!= not equal;
Besides, we can filter string columns. For example, we select the DataFrame in which gdpr
starts with string’F’.
df[df['prov'].str.startswith('G')]
3. filter by multiple conditions
(1) Using loc with multiple conditions
For example, we filter the rows where pop
is greater than 10, gdp
is greater than 6, gdpr
starts with string ‘F’.
df.loc[(df['pop']>10)&(df['gdp']>6)&(df['gdpr'].str.startswith('F'))]
(2) Using NumPy where
with multiple conditions
We can achieve the above results using the Numpy where()
function.
import numpy as np
idx = np.where((df['pop']>10)&(df['gdp']>6)&(df['gdpr'].str.startswith('F')))
idx
(array([13, 14, 15, 16], dtype=int64),)
# Dsiply it as a DataFrame
df.loc[idx]
(3) Using Query
with multiple conditions
Now, let’s use pandas query()
function to do the above task as follows.
df.query("pop>10 & gdp>6&gdpr.str.startswith('F')")
(4) Using eval
with multiple conditions
Besides, we can also use Pandas eval
to filter the DataFrame.
df[df.eval("pop>10 & gdp>6&gdpr.str.startswith('F')")]
4. Online course
If you are interested in learning data analysis in details, you are welcome to enroll one of my course:
Master Python Data Analysis and Modelling Essentials