10 Sep

Linear Regression

Below is an example of a regression problem using linear regression to predict house prices from a dataset provided on Kaggle. The dataset can be found at the link below. Note that I have tried to reduce the amount of regularization found in sklearn's implementation of the model. As well as attempting to compare and contrast the performance of both statsmodel's linear regression model and sklearn's linear regression model. I try to avoid a lot of feature engineering here, as this notebook is just an example of the models.

Import Preliminaries

In [2]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

# Import modules
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib as mpl
import numpy as np
import pandas as pd 
import seaborn as sns
import statsmodels.api as sm
import warnings

from sklearn.datasets import load_diabetes
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from statsmodels.regression import linear_model

# Import models
from sklearn.linear_model import LinearRegression


# Set pandas options
pd.set_option('max_columns',1000)
pd.set_option('max_rows',100)

# Set plotting options
mpl.rcParams['figure.figsize'] = (9.0, 6.0)

# Set warning options
warnings.filterwarnings('ignore');

Import Data

In [3]:
# Import diabetet data
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Conduct a train-test-split on the data
train_x, test_x, train_y, test_y = train_test_split(X,y, test_size=0.25)

# View the training dataframe
pd.DataFrame(train_x, columns=diabetes['feature_names']).head(5)
Out[3]:
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.027178 -0.044642 0.049840 -0.055018 -0.002945 0.040648 -0.058127 0.052759 -0.052959 -0.005220
1 0.038076 0.050680 -0.024529 0.004658 -0.026336 -0.026366 0.015505 -0.039493 -0.015998 -0.025930
2 -0.060003 0.050680 0.049840 0.018429 -0.016704 -0.030124 -0.017629 -0.002592 0.049769 -0.059067
3 0.027178 0.050680 0.017506 -0.033214 -0.007073 0.045972 -0.065491 0.071210 -0.096433 -0.059067
4 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993 -0.011080 -0.046879 0.015491

Target Outcomes

In [4]:
# Plot a histogram of all the home price valuse
pd.Series(train_y).plot.hist(grid=False, color='#5C3697', edgecolor='w',)
plt.title('Distribution of Sales Prices Outcomes')
plt.ylabel('Frequency')
plt.xlabel('Measure of Disease Progression')
plt.axvline(pd.Series(train_y).mean(), color='black', linestyle='dashed', linewidth=2);

Fit the Model

In [5]:
# Fit a intial linear model
lr_model = LinearRegression()
lr_model.fit(train_x, train_y);

Model Evaluation

Cross Validatino Score
In [6]:
# Calculate our cross validation score
scores = cross_val_score(lr_model, train_x, train_y, cv=10, 
                scoring='accuracy').mean()
print(f'Cross Validation Score: {scores.mean():.5f}')
Cross Validation Score: 0.49917

Viewing Residuals

In [7]:
# Plotting residuals from our model
pred_y = lr_model.predict(train_x)
sns.residplot(pred_y, train_y, color='#AD59FF')
plt.title('Linear Model Residuals')
plt.ylabel('Price Error')
plt.xlabel('Record');
Coefficient Magnitude
In [8]:
# Plot our coefficient magnitude with variating markes
markers = ['^' if i >=0 else 'v' for i in lr_model.coef_.T]
for i, j in enumerate(markers):
    plt.plot(i,lr_model.coef_[i], marker=markers[i],color='#AB56FF', linewidth=0, markersize=12)
    plt.xticks(range(diabetes.data.shape[1]), diabetes.feature_names, rotation=90);
plt.title('Coefficient Magnitude Plot')
plt.ylabel('Coefficient Magnitude')
plt.xlabel('Feature')
plt.grid();

Alternative: Stat Models

Using thest stats models can give us a bit more information about the linear regression then sklearn. So lets give it a try.

In [9]:
# Add constant to trainin data
train_x = sm.add_constant(train_x)

# Conduct linear regression via stats model's API
myregression = linear_model.OLS(train_y, train_x).fit()
myregression.summary()
Out[9]:
OLS Regression Results
Dep. Variable: y R-squared: 0.546
Model: OLS Adj. R-squared: 0.531
Method: Least Squares F-statistic: 38.41
Date: Sun, 16 Sep 2018 Prob (F-statistic): 4.21e-49
Time: 10:43:58 Log-Likelihood: -1784.3
No. Observations: 331 AIC: 3591.
Df Residuals: 320 BIC: 3632.
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 152.8390 2.976 51.355 0.000 146.984 158.694
x1 24.0935 69.715 0.346 0.730 -113.064 161.251
x2 -277.7231 70.296 -3.951 0.000 -416.024 -139.422
x3 575.2120 75.563 7.612 0.000 426.550 723.874
x4 289.5050 73.612 3.933 0.000 144.680 434.330
x5 -819.4125 460.955 -1.778 0.076 -1726.298 87.473
x6 487.2440 375.748 1.297 0.196 -252.005 1226.493
x7 44.6173 231.245 0.193 0.847 -410.335 499.570
x8 57.2229 178.351 0.321 0.749 -293.666 408.112
x9 777.5235 190.691 4.077 0.000 402.357 1152.690
x10 67.5685 74.858 0.903 0.367 -79.707 214.844
Omnibus: 1.399 Durbin-Watson: 1.929
Prob(Omnibus): 0.497 Jarque-Bera (JB): 1.299
Skew: 0.002 Prob(JB): 0.522
Kurtosis: 2.693 Cond. No. 218.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

General Notes

-- The simplest and most classic linear model
-- The algorithm minimizes the mean squared error
-- You can use either Sklearn or Stats Models for the type of problem. Stats models is a bit more informative

Author: Kavi Sekhon