10 Sep

## Linear Regression

Below is an example of a regression problem using linear regression to predict house prices from a dataset provided on Kaggle. The dataset can be found at the link below. Note that I have tried to reduce the amount of regularization found in sklearn's implementation of the model. As well as attempting to compare and contrast the performance of both statsmodel's linear regression model and sklearn's linear regression model. I try to avoid a lot of feature engineering here, as this notebook is just an example of the models.

### Import Preliminaries¶

In [2]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

# Import modules
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import warnings

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from statsmodels.regression import linear_model

# Import models
from sklearn.linear_model import LinearRegression

# Set pandas options
pd.set_option('max_columns',1000)
pd.set_option('max_rows',100)

# Set plotting options
mpl.rcParams['figure.figsize'] = (9.0, 6.0)

# Set warning options
warnings.filterwarnings('ignore');


### Import Data¶

In [3]:
# Import diabetet data
X = diabetes.data
y = diabetes.target

# Conduct a train-test-split on the data
train_x, test_x, train_y, test_y = train_test_split(X,y, test_size=0.25)

# View the training dataframe

Out[3]:
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.027178 -0.044642 0.049840 -0.055018 -0.002945 0.040648 -0.058127 0.052759 -0.052959 -0.005220
1 0.038076 0.050680 -0.024529 0.004658 -0.026336 -0.026366 0.015505 -0.039493 -0.015998 -0.025930
2 -0.060003 0.050680 0.049840 0.018429 -0.016704 -0.030124 -0.017629 -0.002592 0.049769 -0.059067
3 0.027178 0.050680 0.017506 -0.033214 -0.007073 0.045972 -0.065491 0.071210 -0.096433 -0.059067
4 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993 -0.011080 -0.046879 0.015491

### Target Outcomes¶

In [4]:
# Plot a histogram of all the home price valuse
pd.Series(train_y).plot.hist(grid=False, color='#5C3697', edgecolor='w',)
plt.title('Distribution of Sales Prices Outcomes')
plt.ylabel('Frequency')
plt.xlabel('Measure of Disease Progression')
plt.axvline(pd.Series(train_y).mean(), color='black', linestyle='dashed', linewidth=2);


### Fit the Model¶

In [5]:
# Fit a intial linear model
lr_model = LinearRegression()
lr_model.fit(train_x, train_y);


### Model Evaluation¶

##### Cross Validatino Score¶
In [6]:
# Calculate our cross validation score
scores = cross_val_score(lr_model, train_x, train_y, cv=10,
scoring='accuracy').mean()
print(f'Cross Validation Score: {scores.mean():.5f}')

Cross Validation Score: 0.49917


#### Viewing Residuals¶

In [7]:
# Plotting residuals from our model
pred_y = lr_model.predict(train_x)
plt.title('Linear Model Residuals')
plt.ylabel('Price Error')
plt.xlabel('Record');

##### Coefficient Magnitude¶
In [8]:
# Plot our coefficient magnitude with variating markes
markers = ['^' if i >=0 else 'v' for i in lr_model.coef_.T]
for i, j in enumerate(markers):
plt.plot(i,lr_model.coef_[i], marker=markers[i],color='#AB56FF', linewidth=0, markersize=12)
plt.xticks(range(diabetes.data.shape[1]), diabetes.feature_names, rotation=90);
plt.title('Coefficient Magnitude Plot')
plt.ylabel('Coefficient Magnitude')
plt.xlabel('Feature')
plt.grid();


### Alternative: Stat Models¶

Using thest stats models can give us a bit more information about the linear regression then sklearn. So lets give it a try.

In [9]:
# Add constant to trainin data

# Conduct linear regression via stats model's API
myregression = linear_model.OLS(train_y, train_x).fit()
myregression.summary()

Out[9]:
Dep. Variable: R-squared: y 0.546 OLS 0.531 Least Squares 38.41 Sun, 16 Sep 2018 4.21e-49 10:43:58 -1784.3 331 3591. 320 3632. 10 nonrobust
coef std err t P>|t| [0.025 0.975] 152.8390 2.976 51.355 0.000 146.984 158.694 24.0935 69.715 0.346 0.730 -113.064 161.251 -277.7231 70.296 -3.951 0.000 -416.024 -139.422 575.2120 75.563 7.612 0.000 426.550 723.874 289.5050 73.612 3.933 0.000 144.680 434.330 -819.4125 460.955 -1.778 0.076 -1726.298 87.473 487.2440 375.748 1.297 0.196 -252.005 1226.493 44.6173 231.245 0.193 0.847 -410.335 499.570 57.2229 178.351 0.321 0.749 -293.666 408.112 777.5235 190.691 4.077 0.000 402.357 1152.690 67.5685 74.858 0.903 0.367 -79.707 214.844
 Omnibus: Durbin-Watson: 1.399 1.929 0.497 1.299 0.002 0.522 2.693 218

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

### General Notes¶

-- The simplest and most classic linear model
-- The algorithm minimizes the mean squared error
-- You can use either Sklearn or Stats Models for the type of problem. Stats models is a bit more informative

Author: Kavi Sekhon