Cross Validation and K-Fold
Cross-Validation is the general scoring technique for the model using only the training data set. The K-Fold cross-validation is a better heuristic for evaluating your model performance by running cross-validation on your training data multiple (K) times. Where each fold is unique from the last. Then taking the scores as the mean from al the cross validation scores.
Cross Validation: First 80% train -> Last 20% test
K-Fold Cross (5 Sets):
unique 80% train -> unique 20% test -> Score -\
unique 80% train -> unique 20% test -> Score --\
unique 80% train -> unique 20% test -> Score ----} Mean Score
unique 80% train -> unique 20% test -> Score --/
unique 80% train -> unique 20% test -> Score -/
You can implement the K-Fold cross in a startified manner where each sample with each validation set is randomly selected. Note, Shuffled K-Fold Cross Validation is not recommend on data with seasonality or time series data.
Import Prelinimaries¶
# Import modules
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
# Import data
boston = load_boston()
X, y = boston.data, boston.target
Fit Model¶
# Fit Regression model
regression = LinearRegression()
regression.fit(X,y);
Cross Validation Score¶
regression.score(X,y)
K-Fol Cross Validation Score¶
crossvalidation = KFold(n_splits=20, random_state=1, shuffle=False)
scores = cross_val_score(regression, X, y,
scoring ='neg_mean_squared_error',
cv=crossvalidation, n_jobs =1)
print('Folds: %i, mean squared error: %.2f std: %.2f' % (len(scores),
np.mean(np.abs(scores)),
np.std(scores)))
Shuffled K-Fold Cross Validation Score¶
crossvalidation = KFold(n_splits=20, random_state=1, shuffle=True)
scores = cross_val_score(regression, X, y,
scoring ='neg_mean_squared_error',
cv=crossvalidation, n_jobs =1)
print('Folds: %i, mean squared error: %.2f std: %.2f' % (len(scores),
np.mean(np.abs(scores)),
np.std(scores)))