Practical Methods for Dataset Slice and Slice Selection with Python Pandas

Use a real-world dataset to display different methods for dataset slice and select

In this article, we will learn how to slice a Dataframe and select the columns and the rows and the slice using Python Pandas package. We will use the cleaned dataset gdp_china_outlier_treated.csv, which can be downloaded from my GitHub repository by clicking this link. I suggest you creating a data folder in your working directory, and moving the downloaded dataset in this folder.

If you are not familiar with the process of data cleaning, you are recommended reading the articles on data cleaning process, including variables rename, missing values detection and missing values imputation, as well as outliers detection and treatment.

First, let’s read the dataset from the data folder in the working directory in Pandas’ DataFrame, or read the dataset from GitHub directly using the method introduced in a previous article.

# Load the required packages
import pandas as pd

# Read the data
df = pd.read_csv('./data/gdp_china_outlier_treated.csv')

# diplay the first 5 rows
df.head()
png

1. Selecting columns

let’s see how to select certain columns.

(1) One column

Method 1: Use column name to subset the data frame

gdp = df[['gdp']]
gdp.head()
png

Method 2: Use the column name as an attribute

gdp = df.gdp
gdp.head()

However, sometime this method causes error. For example:

pop = df.pop
pop.head()

The reason for this error is that it conflicts with the built-in function of pop() in Python.

(2) Select multiple columns

Let’s see how to select multiple columns.

Method 1: using a name list

df_new = df[['prov','year','gdp']]
df_new.head()
png

Method 2: using drop() function

We can use drop() when selecting multiple columns, except one or certain rows.

(i) Select all columns except one

We select all columns except only year.

df_new = df.drop('year',axis=1)
df_new.head()
png

(ii) select all columns except two or more

We select all columns except year and gdpr.

df_new = df.drop(['year','gdpr'],axis=1)
df_new.head()
png

3. Select rows

(1) select first N rows

df.head(6)
png

(2) select last N rows

df.tail(6)
png

(3) randomly select N rows

df.sample(7)
png

(4) using row index

For example, select the first 5 rows

df[0:5]
# equal to df.head()
png

Select last five rows.

df[-5:]
#equal to df.tail()
png

(5) using drop() function

For example, drop rows at the indexes of 0 and 3.

df_new = df.drop([0,3],axis=0)
df_new.head()
png

(6) using .loc indexer

For example, select the first to the seventh rows.

df.loc[0:6]
png

(7) using .iloc indexer

To select the first 7 rows, we can do the following:

df.iloc[0:7]
png

4. Slicing rows and columns

(1) using .loc

We can use the row index and the column names.

df.loc[0:6,['prov','year','gdp']]
png
# all rows
df.loc[:,['prov','year','gdp']]
png
# exclude one column
df.loc[:,df.columns!='year']
png

(2) using .iloc

# continous columns
df.iloc[0:6,0:4]
png
# discontinuous columns
df.iloc[0:6,[0,2,4]]
png

5. Difference between loc and iloc

The main difference between loc indexer iloc indexer are as follows:

  • loc: label-based, which specify rows and columns based on their row and column labels.
  • iloc: integer position-based, which specify rows and columns by their integer position values (0-based integer position).

6. Online course

If you are interested in learning data analysis in details, you are welcome to enroll one of my course:

Master Python Data Analysis and Modelling Essentials

Bookmark
Please login to bookmarkClose
0 - 0

Thank You For Your Vote!

Sorry You have Already Voted!

Please follow and like me:

Leave a Reply

Your email address will not be published. Required fields are marked *