Grandient Boosting Classifier
Gradient Boosting Trees follow the same type of logic as Random Forest Classifiers, instead of using one tree lets build a ensemble. The key differenc here is that instead building all the trees out ina mutually exculsive manner, lets instead build them out sequentially. Where each new tree tries to corret the errors for the last one
In this notebook, I will be looking at the famous breastcancer dataset. This dataset is a multi-class classification problem, where I need to predict the correct target for each observation from a range of possible classes. We will attempt to predict the proper target class using this model, given the feature of each type of class, I often reuse this dataset between my tree-based notebooks. Using the same dataset makes it very easy to compare and contrast the performance of different tree-based models, and keep the trees a reasonable size.
Dataset: Breast Cancer
Import Preliminaries¶
%matplotlib inline
%config InlineBackend.figure_format='retina'
# Import modules
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn
import warnings
from sklearn.datasets import load_breast_cancer
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
# Set pandas options
pd.set_option('max_columns',1000)
pd.set_option('max_rows',30)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# Set plotting options
mpl.rcParams['figure.figsize'] = (9.0, 3.0)
# Set warning options
warnings.filterwarnings('ignore');
Import Data¶
# Import Breast Cancer data
breast_cancer = load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target
# Conduct a train-test split on the data
train_x, test_x, train_y, test_y = train_test_split(X,y)
# View the training dataframe
pd.DataFrame(train_x, columns=breast_cancer['feature_names']).head(5)
Data Overview¶
# Plot a barplot of the target clasees
pd.Series(train_y).value_counts().plot.barh(grid=False, color=['#B2E2E2','#66C2A4'], width=0.25,edgecolor='w')
plt.title('Target Outcomes')
plt.ylabel('Cl
ass')
plt.xlabel('Measure of Disease Progression');
Fit the Model¶
# Fit the intial model
gb_model =GradientBoostingClassifier(n_estimators=100,learning_rate=0.1)
gb_model.fit(train_x, train_y);
Model Evaluation¶
Cross Validation Score¶
# View the cross validation score of the intial model
scores = cross_val_score(gb_model, train_x, train_y, cv=10,
scoring='accuracy')
print(f'Cross Validation Score: {scores.mean():.5f}')
Confustion Matrix¶
# Training confusion matrix
from sklearn.metrics import confusion_matrix
cmatrix = pd.DataFrame(confusion_matrix(train_y, gb_model.predict(train_x)))
cmatrix.index.name = 'class'
cmatrix['result'] = 'actual'
cmatrix.set_index('result', append=True, inplace=True)
cmatrix = cmatrix.reorder_levels(['result', 'class'])
cmatrix = cmatrix.stack()
cmatrix = pd.DataFrame(cmatrix)
cmatrix.columns = ['prediction']
cmatrix.unstack()
Feature Importance¶
# Plot ensemble's feature importance
mpl.rcParams['figure.figsize'] = (9.0, 6.0)
n_features = breast_cancer.data.shape[1]
plt.barh(range(n_features), gb_model.feature_importances_, align='center', color='#4D977E')
plt.yticks(np.arange(n_features), breast_cancer.feature_names)
plt.title('Gradient Boosting Tree Feature Importance')
plt.xlabel("Feature importance")
plt.ylabel("Features")
plt.ylim(-1, n_features);
Parameter Tuning¶
# Define paraameter range and score lists
n_estimators_range = np.arange(0,300, 25)[1:]
train_score = []
test_score = []
# Train a knn_model for every neighbour value in our list
for i in n_estimators_range:
gb_model=GradientBoostingClassifier(n_estimators = i).fit(train_x,train_y)
train_score.append(cross_val_score(gb_model, train_x, train_y, cv=10, scoring='accuracy').mean())
test_score.append(cross_val_score(gb_model, test_x, test_y, cv=10, scoring='accuracy').mean())
# Plot our results
mpl.rcParams['figure.figsize'] = (9.0, 6.0)
plt.plot(n_estimators_range,train_score,label="Train",linewidth=2, color='#66C2A4')
plt.plot(n_estimators_range,test_score,label="Test", linewidth=2,linestyle='--', color='#B2E2E2')
plt.legend()
plt.title('Gradient Boosting Model')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy');
Grid Searching Turnning¶
# Set up parameter grid
grid = {'learning_rate':np.arange(0.1,1.0,0.2),
'n_estimators':np.arange(0,100, 25)[1:],
'max_depth':list(range(2,30,4)),
'max_features': list(range(2,30,2)),
'max_leaf_nodes':[5,10,25,50,75,100]}
# Conduct gird search
grid_search = GridSearchCV(estimator=rf_model, param_grid=grid,
scoring='accuracy', n_jobs=-1, refit=True, cv=10,
return_train_score=True)
# Fit model
grid_search.fit(train_x,train_y);
print('Accuracy of best parameters: %.5f'%grid_search.best_score_)
print('Best parameters: %s' %grid_search.best_params_)
Final Model¶
# Fit the final model
gd_model = GradientBoostingClassifier(learning_rate = 0.5000000000000001,
max_depth=2, max_features=10,
max_leaf_nodes=50, n_estimators=50)
gd_model.fit(train_x, train_y)
# View the cross validation score of the intial model
scores = cross_val_score(gb_model, train_x, train_y, cv=10,
scoring='accuracy')
print(f'Cross Validation Score: {scores.mean():.5f}')
Confustion Matrix¶
# Training confusion matix
from sklearn.metrics import confusion_matrix
cmatrix = pd.DataFrame(confusion_matrix(train_y, gb_model.predict(train_x)))
cmatrix.index.name = 'class'
cmatrix['result'] = 'actual'
cmatrix.set_index('result', append=True, inplace=True)
cmatrix = cmatrix.reorder_levels(['result', 'class'])
cmatrix = cmatrix.stack()
cmatrix = pd.DataFrame(cmatrix)
cmatrix.columns = ['prediction']
cmatrix.unstack()
Feature Importance¶
# Plot ensembles's feature importance
n_features = breast_cancer.data.shape[1]
plt.barh(range(n_features),gb_model.feature_importances_, align='center', color='#4D977E')
plt.yticks(np.arange(n_features), breast_cancer.feature_names)
plt.title('Gradient Booosting Tree Feature Importance')
plt.xlabel("Feature importance")
plt.ylabel("Features")
plt.ylim(-1, n_features);
Predict Results¶
# Predict the results from our test data
pd.Series(rf_model.predict(test_x)).head(n=7)
General Notes¶
-- Gradient boosting tree often use shallow trees (weak learners)
-- Each tree provides good predictions on a part of the data, so more trees improve performance
-- A higher learning rate means that each tree can make more corrections meaning allowing for more complex models
-- To pre-prune the tree lower the learning rate or reduce the max depth
-- Moving to Gradient Boosting Trees lets you squeeze the last few percentages of accuracy
Author: Kavi Sekhon