Linear Regression
Below is an example of a regression problem using linear regression to predict house prices from a dataset provided on Kaggle. The dataset can be found at the link below. Note that I have tried to reduce the amount of regularization found in sklearn's implementation of the model. As well as attempting to compare and contrast the performance of both statsmodel's linear regression model and sklearn's linear regression model. I try to avoid a lot of feature engineering here, as this notebook is just an example of the models.
Import Preliminaries¶
%matplotlib inline
%config InlineBackend.figure_format='retina'
# Import modules
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import warnings
from sklearn.datasets import load_diabetes
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from statsmodels.regression import linear_model
# Import models
from sklearn.linear_model import LinearRegression
# Set pandas options
pd.set_option('max_columns',1000)
pd.set_option('max_rows',100)
# Set plotting options
mpl.rcParams['figure.figsize'] = (9.0, 6.0)
# Set warning options
warnings.filterwarnings('ignore');
Import Data¶
# Import diabetet data
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# Conduct a train-test-split on the data
train_x, test_x, train_y, test_y = train_test_split(X,y, test_size=0.25)
# View the training dataframe
pd.DataFrame(train_x, columns=diabetes['feature_names']).head(5)
Target Outcomes¶
# Plot a histogram of all the home price valuse
pd.Series(train_y).plot.hist(grid=False, color='#5C3697', edgecolor='w',)
plt.title('Distribution of Sales Prices Outcomes')
plt.ylabel('Frequency')
plt.xlabel('Measure of Disease Progression')
plt.axvline(pd.Series(train_y).mean(), color='black', linestyle='dashed', linewidth=2);
Fit the Model¶
# Fit a intial linear model
lr_model = LinearRegression()
lr_model.fit(train_x, train_y);
Model Evaluation¶
Cross Validatino Score¶
# Calculate our cross validation score
scores = cross_val_score(lr_model, train_x, train_y, cv=10,
scoring='accuracy').mean()
print(f'Cross Validation Score: {scores.mean():.5f}')
Viewing Residuals¶
# Plotting residuals from our model
pred_y = lr_model.predict(train_x)
sns.residplot(pred_y, train_y, color='#AD59FF')
plt.title('Linear Model Residuals')
plt.ylabel('Price Error')
plt.xlabel('Record');
Coefficient Magnitude¶
# Plot our coefficient magnitude with variating markes
markers = ['^' if i >=0 else 'v' for i in lr_model.coef_.T]
for i, j in enumerate(markers):
plt.plot(i,lr_model.coef_[i], marker=markers[i],color='#AB56FF', linewidth=0, markersize=12)
plt.xticks(range(diabetes.data.shape[1]), diabetes.feature_names, rotation=90);
plt.title('Coefficient Magnitude Plot')
plt.ylabel('Coefficient Magnitude')
plt.xlabel('Feature')
plt.grid();
Alternative: Stat Models¶
Using thest stats models can give us a bit more information about the linear regression then sklearn. So lets give it a try.
# Add constant to trainin data
train_x = sm.add_constant(train_x)
# Conduct linear regression via stats model's API
myregression = linear_model.OLS(train_y, train_x).fit()
myregression.summary()
General Notes¶
-- The simplest and most classic linear model
-- The algorithm minimizes the mean squared error
-- You can use either Sklearn or Stats Models for the type of problem. Stats models is a bit more informative
Author: Kavi Sekhon