Use different methods to detect missing values in a dataset using Python Pandas library

This article is the Part IV of ** Data Analysis Series**, which includes the following parts. I suggest you read from the first part so that you can better understand the whole process.

*Part I: How to Read Dataset from GitHub and Save it using Pandas**Part II: Convenient Methods to Rename Columns of Dataset with Pandas in Python**Part III: Different Methods to Access General Information of A Dataset with Python Pandas**Part IV: Different Methods to Easily Detect Missing Values in Python**Part V: Different Methods to Impute Missing Values of Datasets with Python Pandas**Part VI: Different Methods to Quickly Detect Outliers of Datasets with Python Pandas**Part VII: Different Methods to Treat Outliers of Datasets with Python Pandas**Part VIII: Convenient Methods to Encode Categorical Variables in Python*

In this article, I will display different methods to easily detect the missing values in your datasets using Pandas in Python. In the previous articles, we talked about ** How to Read Dataset from GitHub and Save it using Pandas** and

*Convenient Methods to Rename Columns of Dataset with Pandas***I suggest you reading these two articles first because we will use the precessed dataset in the articles. If you are familiar with these, and you also have your own dataset to use, you can skip these two articles. This article will follow the following working process:**

*.*- Generally check if there is any missing value, if so
- Calculate the frequency of missing (or how many)
- Find the locations of the missing values are (i.e. their indices)

#### 1. Preparation to Start

#### (1) import required packages

`import numpy as np`

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

#### (2) read the dataset

In the last post, we have renamed the columns of the dataset to the abbreviations of their original names and saved the renamed dataset `gdp_china_renamed.csv`

into the local working directory. Let’s import the dataset now.

`df = pd.read_csv('./data/gdp_china_renamed.csv')`

# display columns

df.columns

#### 2. Different Methods to Detect Missing Values

First, we just check generally whether there are any missing using some handy methods.

#### 2.1 Roughly check if there are any missing

There are different methods to check generally whether there are any missing using some handy methods. In this article, three practical methods will be introduced as follows.

#### (1) Using info( ) function

`df.info()`

The result looks as follows:

The above results show that pop might have 2 missing values since it has only 93 values in the dataset while other variables have 95 values. I say “might” because the total number of each variable are calculated based on non-null values. That is to say, the values might be zeros rather than missing in many cases. For this case and dataset, however, population (pop) cannot be zero, so we can conclude that pop has two missing values.

#### (2) Using simple line plot

We can also use a simple line plot to see if there are any missing. Whereas, the limitation of this method is not suitable for large and noisy data.

`cols = ['year', 'gdp', 'pop', 'finv', 'trade', 'fexpen','uinc']`

plt.figure(figsize=(15,15))

for i, each in enumerate(cols):

plt.subplot(len(cols),2,i+1)

plt.plot(df[each])

plt.title(each)

plt.tight_layout()

plt.show()

The plot results clearly display that pop has missing values.

#### (3) `isna()`

method

The third method here is to use Pandas’ `isna().any()`

to detect which variables have missings.

`df.isna().any()`

#### (4) Using heatmap

Besides, the null-value heatmap with seaborn can be used to clearly illustrate if there is any missing in a dataset.

`sns.heatmap(df.isna(),cbar=False) `

From the heatmap, we also can see there are 2 missings in the population data.

#### 2.2 The frequency of Missings (absolute)

Next, let’s see how to calculate the number and frequency of missings though some methods above can also roughly show how many missings in the data.

#### (1) use `isna().sum().sum()`

`missings = df.isna().sum().sum()`

print(f'The number of missing values: {missings}')

The result is:

The number of missing values: 2

#### (2) use `isna().sum()/(len(data))`

`df.isna().sum()/(len(df))*100`

We can print a readable result on the columns with lowest and highest amounts of missing frequencies in the following way.

`mis_freq = df.isna().sum()/len(df)*100`

print("Column with the lowest amount of missings contains {} % missings.".format(mis_freq.min()))

print("Column with the highest amount of missings contains {} % missings.".format(mis_freq.max()))

#### 2.3 Determine positions of missings

Let’s further detect where the missing are, i.e. their indices.

#### (1) Determine the columns of the missings.

`m_col = df.loc[:, df.isnull().any()].columns`

print(f'The columns with missings is/are {m_col[0]}.')

The columns with missings is/are pop.

#### (2) Determine row index of the missings

There are different ways to find the indices of the missings .

*(I) use Pandas index*

`df[df['pop'].isna()].index`

Int64Index([4, 23], dtype=’int64′)

*(II) use NumPy where*

`np.where(df['pop'].isna())`

(array([ 4, 23], dtype=int64),)

*(III) use Pandas DataFrame table*

`df[df['pop'].isna()]`

All the results from the above methods show that the row indices of 4 and 23 contain missing values. Finally, we found that the population column has 2 missing values with the row index of 4 and 23, respectively. Then you can replace the missings in the next step, which I will probably discuss in the future.

### 3. Online course

If you are interested in learning Python data analysis in details, you are welcome to enroll one of my courses:

Master Python Data Analysis and Modelling Essentials