Convenient Methods to Encode Categorical Variables in Python

Easy and flexible methods to encode categorical variables for further quantitative analysis

In many cases, we need to transfer categorical or string variables into numbers in order to analyze the data quantitatively or develop a model. There are many methods to do that. In this article, I will show you some easy but practical methods to encode columns with categorical data or strings into numerical values.

This article is the Part VII of Data Analysis Series, which includes the following parts. I suggest you read from the first part so that you can better understand the whole process.

and maybe more.

First, let’s read the dataset whose outliers have been treated and saved as gdp_china_outlier_treated.csv in the last part.

# import required packages
import pandas as pd

# read data
df = pd.read_csv('./data/gdp_china_outlier_treated.csv')
# display the first 5 rows

1. Factorize methods

Pandas provide a very convenient .factorize method used to factorize a string or more string columns in a pandas DataFrame.

(1) Factorize one column

Suppose we have only one string column, or just factorize one column. First, I will use a copy of the dataset in the example, or the original dataset will be changed. We will still use the original dataset for other methods.

df_copy = df.copy()
df_copy['prov'] = pd.factorize(df_copy['prov'])[0]

The above table shows that prov column has been factorized.


(2) Factorize more columns

In most cases, there are more categorical or string columns in a dataset. In our dataset here, there are two string columns.

df_copy2 = df.copy()
df_copy2[['prov','gdpr']]= df[['prov','gdpr']].apply(lambda x: pd.factorize(x)[0])

From the above result, we can see the prov and gdpr columns have been factorized.

(3) Factorize All Columns

In a few cases, maybe we meet to factorize all columns, and let’s create one such case.

#create DataFrame
df_task = pd.DataFrame({'team': ['A', 'B', 'C', 'D','E'],
'task': ['a', 'b', 'b', 'a','c'],
'rank': ['Excellent','Very good','Excellent','Good', 'Very good'] })

#view DataFrame

Then you can use the following method.

df_task = df_task.apply(lambda x: pd.factorize(x)[0])

#view updated DataFrame

2. Replace Methods

In some cases, we perhaps want to start from 1 rather than 0. For example, you think that it is better to encode gdpr (GDP ranking), ⁣first to fifth into their number forms 1 to 5 in the dataset in this article. Then you can use replace method.

(1) One columns

df_rep = df.replace(to_replace=['First','Second','Third','Fourth','Fifth'],value=[1,2,3,4,5])

(2) More columns

Support we also want to replace with the five provinces with numbers from 0.

numbered = {
"prov": {"Guangdong": 1, "Jiangsu": 2, "Shandong": 3,"Jiangsu": 4,"Henan": 5},
'gdpr':{'First':1, 'Second':2, 'Third':3, 'Fourth':4,'Fifth':5},

df_rep2 = df.replace(numbered)

3. Dummy Methods

The get_dummies() method is used to encode string columns.

(1) One column

dum = pd.get_dummies(df['prov'])

A categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables. Otherwise, it might cause Dummy Variable Trap, where two or more dummy variables created by one-hot encoding are highly correlated (i.e. multi-collinear)

dum = pd.get_dummies(df['prov'], prefix='prov',drop_first=True)

(2) More columns

dums = pd.get_dummies(df[['prov','gdpr']])

The question here is how to add these dummies to the Dataframe.

(3) Add dummies to Pandas DataFrame

df_coded = df.join(dums)

(4) A more compact method

Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly, while it could only handle a single Series before that.

df_dum_coded = pd.get_dummies(df)


If you want to choose specific dummy columns, you can use the following handy way to specify only one or two columns by their names.

df_selet_dum_coded = pd.get_dummies(data=df, columns=['prov', 'gdpr'])

4. OneHotEncoder

The last method that I display here is the OneHotEncoder() function of the scikit-learn library.

(1) remove the gdpr column

The gdpr is actually not helpful for the future data analysis because it is just an alternative label of provinces, thus let’s remove it first. If you do want to keep it, just do it. I use the original Dataframe here since it is the last example.

df.drop(['gdpr'],axis=1, inplace=True)

gdpr column has been removed.

(2) import function, define the encoder and fit_transform of string column

from sklearn.preprocessing import OneHotEncoder

# sparse parameter (=True or False): will return a sparse matrix or dense array
# we will not drop the first column of the encoded
enc = OneHotEncoder(sparse=False) # remove first column,set drop='first'

# fit and transform of the prov column
trans = enc.fit_transform(df[['prov']])

(3) show the default names of encoded columns

feature_names = enc.get_feature_names_out(['prov'])

We can change the names into abbreviated ones for analysis convenience, especially as modelling variables to develop models in the future.

feature_names = ['prov_gd','prov_hn', 'prov_js', 'prov_sd', 'prov_zj']

(4) display them as a Pandas DataFrame

feature_coded = pd.DataFrame(trans, columns=feature_names)

(5) Add it into the original DataFrame

I use concatenate here to combine the encoded columns with the original DataFrame, then drop the prov column since we have encoded it.

df_coded = pd.concat([df,feature_coded],axis=1)
df_coded.drop(['prov'],axis=1, inplace=True)

5. Save the Data

Lastly, let’s save the encoded dataset for future analysis use.

df_coded.to_csv('./data/gdp_china_encoded.csv', index=False)

6. Online Course

If you are interested in learning Python data analysis in details, you are welcome to enroll one of my course:

Master Python Data Analysis and Modelling Essentials

0 - 0

Thank You For Your Vote!

Sorry You have Already Voted!

Please follow and like me:

Leave a Reply

Your email address will not be published. Required fields are marked *