Display how to apply stepwise regression with a concrete real-world example
Stepwise regression is a very helpful approach to select the statistic significant predictors in a traditional regression model. In a previous post, Stepwise regression has been generally introduced. I suggest you glancing over that post first before read this post if you are not familiar with stepwise regression. In this post, my focus is to introduce a stepwise regression package in Python and display how to use it to a concrete real-world dataset. This package can help you avoid many tedious and iterative works to examine if each of the features is statistical significance during model development.
1. Install the packages
This package is called
stepwise-regression, and you can go to its home page in GitHub to see details and the source code.
First, let’s install it using
pip from PyPI.
pip install stepwise-regression
The last line shows that the stepwise-regression package version 1.0.3 has been successfully installed.
2. Use it for a real-world example
(1) Import the required packages
Now, let’s import the required packages. Besides,
stepwise-regression package, we also need
import pandas as pd
import statsmodels.api as sm
from stepwise_regression import step_reg
(2) Read the data
You can download the data in GitHub through this linkage, and then read the data from your local working directory. But here we read the data directly from the GitHub. If you are not familiar with how to read data from GitHub, you can go to one of my previous posts on this topic.
url = ‘https://raw.githubusercontent.com/Sid-149/Life-Expectancy-Predictor-Comparative-Analysis/main/Notebooks/Life%20Expectancy%20Data.csv'
df = pd.read_csv(url,index_col=False)
(3) Missing data imputation
In a previous post, detailed methods on missing value detection and imputation have been discussed. Here, we just see if there are missing values in data and impute them before using the packages, or there will be an error message if there is missing data.
missings = df.isna().sum().sum()
So there are 2563 missing values in the dataset. Next, we impute them using linear interpolation.
(4) Stepwise regression using the packages
In this example, we will create a model to predict Life expectancy. Then we slice the data into independent variables and dependent variables. We also drop the string/categorical variables from independent variables because the aim of this post is to see how to use the packages to do stepwise regression. If you are looking for methods to encode them and include them in your model, you can read this post.
X = df_impute.drop(['Country','Status','Life expectancy'],axis=1)
y = df_impute['Life expectancy']
(I) Whole model with all predictors
Before using the packages, we create a linear regression model included all the selected variables.
# add a constant
X = sm.add_constant(X)
# define the model and fit it
model = sm.OLS(y, X)
results = model.fit()
To display the OSL results, type:
From p-values, we can see Year, Population, thinness 1–19 years and thinness 5–9 years are statistically insignificant at the significant level of 0.05.
Next, let’s see if we can use the package to help us select the feature variables.
There are two functions, namely
forward_regression. There are four parameters in the functions.
X: the independent variables
y：the dependent variable
threshold_in: the threshold value set for p-value, normally 0.05
verbose: the default is False
(a) Backward selection method
Now, let’s see the backward regression.
backselect = step_reg.backward_regression(X, y, 0.05,verbose=False)
The backward selected predictors including the constant are 17 factors.
Next, let’s create a linear regression to check if the all the predictors are significant at level of 0.05.
# add a constant
X_backselect = sm.add_constant(X_backselect)
# define the model and fit it
backmodel = sm.OLS(y, X_backselect)
backres = backmodel.fit()
The OSL results are as follows:
From the above results, we can see that all p-values are less or equal to 0.05, which indicates all the selected predictors are statistically significant at the level of 0.05.
(b) Forward selection method
Now let’s check the
forward_regression function of this package. The function parameters are the same with its
forwardselect = step_reg.forward_regression(X, y, 0.05,verbose=False)
The forward selected features including the constant are:
Compared with the backward selection method, the predictors selected by forward method has only 14 factors. Let’s create a model based on these 14 factors and examine its results.
Compared with the results of backward stepwise regression and forward stepwise regression, we found that the goodness of fit (R²) of backward method is higher than that of forward one in our example.
Normally, forward and backward stepwise regression would give us the same result, but this is not always the case even we have the same features in the final model.
If you are interested in learning Python data analysis in details, you are welcome to enroll one of my courses: