How to Improve your Multi-Factor Model
How can you improve your multi-factor model? If you are comfortable with your feature selection to explain a target, then can you do more? An often-overlooked method is feature evaluation and if there is any redundancy.
Below is a demonstration using the Support Vector Regression (SVR), a contemporary model, with the California Housing dataset, which is an improvement on the classical model of Multiple Linear Regression (MLR). Data points can be outliers and subsets of data can be linearly inseparable are some of the advantages of SVR over MLR. A good understanding is below.
https://medium.com/coinmonks/support-vector-regression-or-svr-8eb3acf6d0ff
We first load the native dataset from the sklearn libraries followed by exploratory data analysis . . . then, split, train, test with all the features and score the goodness of fit.
Thereafter, feature assessment using Variance Inflation Factor (VIF) will provide us reasons to remove any redundant features and we repeat the process to improve our score.
Importing the Dataset
Exploratory Data Analysis
Assign the target variable of the price of housing as y, the feature dataset as X and name the features for ease of identification.
A description of the California housing dataset can be found here below.
http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
There are 20,640 observations with 8 features to drive our explanation of the target.
Assign the variables for analysis.
If we look at the descriptive statistics, then there is good variability across all the features.
Let’s see how related the features are to each other . . .
The movement of Latitude, labeled as 6, is opposite of Longitude, or 7 . . . also there is tandem movement of 2 and 3, AveRooms and AveBedrms.
Common knowledge can accept Latitude and Longitude are opposites and the number of rooms and bedrooms are related in some fashion.
The rest of the features do not have an association as the correlation values are near zero.
We will see further below, . . . how this will play a role in improving our model score.
The Multi — Factor Model: Support Vector Regression (SVR)
The advantage of SVR is the splitting and optimizing of feature data using the algebra of Lagrangian multipliers and the original paper can be found here.
http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf
SVR is an extension of Support Vector Machine framework.
Let’s run our first model and get our first score . . .
Split the dataset into two for training and testing with a target value, y, and a set of predictor values, X.
Y = California Housing Price
X = All other features
Implement by fitting and scoring the SVR model with our dataset of 8 features.
Just like in the MLR model, the SVR model, has a goodness of fit measure, the score. We have a good 0.75 score, so not bad.
(By the way, a 1.00 score would indicate overfitting or the unnecessary capture of statistical noise that would fog our target prediction.)
To elaborate on the score, it is the percentage of the variance in the target variable that is explained by the set of 8 features, collectively . . . or how much do the features explain the target price of housing in California.
Improving the Multi — Factor Model with a Check on Multicollinearity
Multicollinearity or Collinearity occurs when two or more features trying to explain the target variable contain redundant information as they correlate and therefore, inflate the variance.
Removal from the feature dataset is one solution and below we identify which of the original 8 features to remove.
The Variance Inflation Factor (VIF) is a simple and quick check on Multicollinearity and it is the reciprocal of the score’s complement or the equation above.
In absolute terms, there is no concern between 0 and 4 . . . with some relative concern beginning between 5 and 10.
If we look at the VIF results, then 1, 4 and 5, HouseAge, Population and AveOccup are not correlated, as they are less than 10. Greater than 10 we have AveRooms and AveBedrms, 2 and 3, respectively, with some correlation, . . . but obviously there is multicollinearity in the last two features.
The bigger picture of the VIF results is the consistency with our Correlation Matrix Heatmap, since Latitude and Longitude have a strong, not weak, negative correlation of — 0.92.
Let’s drop 6 and 7, Latitude and Longitude, as they are paired features and re-run SVR with 6 features this time to target the price of housing in California.
Reassign the X variable as X2.
No need to change the target variable y, the California Housing Price.
Implement again by fitting and scoring the SVR model with the new dataset X2 and its 6 features instead of the original 8.
We have an improvement on the goodness of fit by about 8 points . . . from 0.75 to 0.83 to target the price California Housing with 6 features instead of 8.