Modelselect package helps get an Optimal Linear Regression Model by removing insignificant variables and solve multicollinearity problems
I developed a small package called modelselect
, which can help you fast develop an optimal linear regression model. This is a brief guide to display how to use Python modelselect
Package rather than developing a linear regression model. Thus, the process does not strictly follow the modelling processing. If you are interested in linear regression modelling process, please read my previous posts.
1. Brief Introduction
(1) What is modelselect
?
A package helps us easily create an optimal linear regression model by removing the insignificant and the multicollinearity predictor variables. This package can reduce the interactive process and tedious work to run the model, estimate it, evaluate it, re-estimate and re-evaluate it, etc. You can find this package on the web of PyPI and GitHub page.
(2) Install the Package
pip install modelselect
(3) Import the Package
from modelselect import LRSelector
then use the LRSelector()
directly. Or
import modelselect as ms
then use ms.LRSelector()
(4) Methods
There are three parameters in the functions. modelselect.LRSelector(X, y, X_drop)
Parameters:
- X: feature variables, normalized or original
- y: target variables
- X_drop: a list contains the names of the variables to be removed. The default is empty, i.e. no drop variables
Returns:
- res: OLS estimation results
- vif: Variance Inflation Factor
- X_new: feature variables after removing variables. When X_drop is default, X_new is equal to X.
2. Use Example
(1) Import required packages
Besides this package, we also required numpy
,Pandas
,statsmodels
, normsscaler
and scikit-learn
. Maybe you are familiar with Pandas
, scikit-learn
, but you probably do not know normscaler
. You can find the normscaler
package on PyPi and in one of previous posts.
You can install them using pip
as follows if you have not installed them.
pip install pandas, scikit-learn, normscaler
Now, let’s import them as follows.
import numpy as np
import pandas as pd
from statsmodels.tools.eval_measures import meanabs,mse,rmse
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import modelselect as ms
from normscaler.scaler import DecimalScaler
(2) Read data to Pandas DataFrame
url = 'https://raw.githubusercontent.com/shoukewei/data/main/data-pydm/gdp_china_encoded.csv'
df = pd.read_csv(url,index_col=False)
# display the first rows
df.head()
(3) Define independent variables (X) and dependent variable (y)
GDP is the target and others are features.
X = df.drop(['gdp'],axis=1)
y = df['gdp']
(4) Split dataset for model training and testing
Split the dataset for model training/estimation and testing/validation.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.30, random_state=1)
(5) Normalize datasets with decimal scaling method
We use the DecimalScaler
(Decimal scaling method) in the normscaler
package to normalize the X_train and X_test.
X_train_scaled, X_test_scaled = DecimalScaler(X_train,X_test)
(6) Create a linear regression model using modelselector
package
First, we will use all the feature variables, i.e. there is drop, or X_drop is the default.
modelres = ms.LRSelector(X_train, y_train)
(7) Display the results
(i) display the OLS regression results
res = modelres[0]
res.summary()
From the above results, we know the model is good, but pop is statistically insignificant at the level of 0.05. There might be strong multicollinearity problems due to the smallest eigenvalue of 1.57e-30.
(ii) display the VIF
It further confirms that there are strong multicollinearity problems due to the larger VIF. It widely accepts that a VIF > 10 as an indicator of multicollinearity, but some scholars choose a more conservative threshold of 5 or even 2.5.
vif = modelres[1]
vif
(8) Improve the model
First, let’s remove the insignificant variable, pop, and run the model to estimate the model again.
X_drop=['pop']
res_drop_pop, vif_drop_pop,X_drop_pop = ms.LRSelector(X_train, y_train,X_drop=X_drop)
print(res_drop_pop.summary())
print(vif_drop_pop)
The results display that all the variables after dropping pop are all statistically significant at the level of 0.05. But VIF results shows that the model still has multicollieality.
Thus, we need to further drop some variables which VIF is larger than the threshold. Let’s start to drop prov_gd with inf VIF.
X_drop=['pop','prov_gd']
res_drop_gd, vif_drop_gd,X_drop_gd = ms.LRSelector(X_train, y_train,X_drop=X_drop)
print(res_drop_gd.summary())
print(vif_drop_gd)
You can see the VIF of some variables are still larger, then we add the year to the drop list due to its larger VIF. We continue the process untill all the VIFs are less than or equal to 10 except the constant. The final results look as follows.
X_drop=['pop','prov_gd','year','fexpen','uinc']
res_drop_final, vif_drop_final,X_drop_final = ms.LRSelector(X_train, y_train,X_drop=X_drop)
print(res_drop_final.summary())
print(vif_drop_final)
(9) Model validation/testing
Let’s test the model using the testing dataset.
X_test_drop = X_test.drop(['pop','prov_gd','year','fexpen','uinc'],axis=1)
X_test_drop = sm.add_constant(X_test_drop)
y_pred = res_drop_final.predict(X_test_drop)
Then, we calculate MAE, MSE, RMSE and MAPE of the testing.
print("mean_absolute_error(MAE): ", meanabs(y_test,y_pred))
print("mean_squared_error(MSE): ", mse(y_test,y_pred))
print("root_mean_squared_error(RMSE): ",rmse(y_test,y_pred))
print ("mean_absolute_percentage_error(MAPE): ",np.mean((abs(y_test-y_pred))/y_test))
mean_absolute_error(MAE): 0.19150624526331034
mean_squared_error(MSE): 0.05467623605841458
root_mean_squared_error(RMSE): 0.23382950211300238
mean_absolute_percentage_error(MAPE): 0.08418013652994877
The testing results show that the model has very good prediction performance.
Online course
If you are interested in learning essential of Python data analysis and modelling in details, you are welcome to enroll one of my courses:
Master Python Data Analysis and Modelling Essentials