Practical Methods to Filter Dataset with Python Pandas

Use real-world dataset to display how to filter dataset with different methods

In the last two articles, I displayed how to slice a dataset and select the slices, how to sort the dataset using a real-world dataset with Python Pandas. In this article, we will continue using the same dataset to learn how to filter the dataset.

Let’s read the dataset first before starting the topic.

# Load the required packages
import pandas as pd

# Read the data
df = pd.read_csv('./data/gdp_china_renamed.csv')

# diplay the first 5 rows
df.head()
png

1. Filter Method

DataFrame.filter(items=None, like=None, regex=None, axis=None)is used to subset the DataFrame rows or columns according to the specified index labels.

(1) Filter columns

(i) select columns by name

The items parameter is used to filter columns by name. For example, filter the three columns as follows.

df.filter(items=['prov','year','gdp'])
df.head()
png

(ii) select columns by regular expression

We can filter the columns that contain a character or characters using regex. For example, we filter the columns contains p.

df.filter(regex='p',axis=1)
png

However, regex='p$' will select the columns that contain p at the end/

df.filter(regex='p
png

(iii) select columns by like parameter

Similar with regex parameter, we can use like parameter to filter columns that contain a character or characters.

df.filter(like='op',axis=1)
png

(3) filter rows

For example, filter rows whose index contains 4.

(i) 4 is at the end

df.filter(regex='4
png

(ii) 4 can be anywhere

df.filter(like='4',axis=0)
png

2. filter rows by conditions

For example, filter rows in which pop is greater than 10.

df[df['pop']>10]
png

Boolean operator used to subset the data:

>   greater; 
>= greater or equal;
< less;
<= less or equal;
== equal;
!= not equal;

Besides, we can filter string columns. For example, we select the DataFrame in which gdprstarts with string’F’.

df[df['prov'].str.startswith('G')]
png

3. filter by multiple conditions

(1) Using loc with multiple conditions

For example, we filter the rows where pop is greater than 10, gdp is greater than 6, gdpr starts with string ‘F’.

df.loc[(df['pop']>10)&(df['gdp']>6)&(df['gdpr'].str.startswith('F'))]
png

(2) Using NumPy where with multiple conditions

We can achieve the above results using the Numpy where()function.

import numpy as np

idx = np.where((df['pop']>10)&(df['gdp']>6)&(df['gdpr'].str.startswith('F')))
idx
(array([13, 14, 15, 16], dtype=int64),)
# Dsiply it as a DataFrame
df.loc[idx]
png

(3) Using Query with multiple conditions

Now, let’s use pandas query() function to do the above task as follows.

df.query("pop>10 & gdp>6&gdpr.str.startswith('F')")
png

(4) Using eval with multiple conditions

Besides, we can also use Pandas eval to filter the DataFrame.

df[df.eval("pop>10 & gdp>6&gdpr.str.startswith('F')")]
png

4. Online course

If you are interested in learning data analysis in details, you are welcome to enroll one of my course:

Master Python Data Analysis and Modelling Essentials

 

Bookmark
Please login to bookmarkClose
0 - 0

Thank You For Your Vote!

Sorry You have Already Voted!

Please follow and like me:

Leave a Reply

Your email address will not be published. Required fields are marked *