02 Aug

Downsampling

In supervised learning, many datasets contain data that is class imbalanced. Therefore you will have to downsample the majority class to match the minority class shape. Below I have created a function that can downsample numpy data.

Import Preliminaries

In [2]:
# Import modules
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

Import Data

In [3]:
# Import data
iris = load_iris()
X, y = iris.data[:80], iris.target[:80]

View Class Imbalance

In [4]:
# View value counts
pd.Series(y).value_counts()
Out[4]:
0    50
1    30
dtype: int64

Retrieve Downsampled Class Index

In [5]:
# Indicies of each class' observation
class_0 = np.where(y==0)[0]
class_1 = np.where(y==1)[0]
class_1[:5], class_0[:5]
Out[5]:
(array([50, 51, 52, 53, 54]), array([0, 1, 2, 3, 4]))
In [6]:
# Random select indices from the magority class to fit
# the minority class
class_0_downsampled = np.random.choice(class_0, 
                                       size=len(class_1), 
                                       replace=False)
class_0_downsampled
Out[6]:
array([ 7, 38, 34, 15,  2, 18,  1, 49, 28,  5, 13, 43, 33, 35, 31, 45, 12,
       23,  9,  0, 29, 41, 24,  6, 14, 27, 44, 10, 11, 26])
In [7]:
# Stack the minority class index and dowmsampled majority class
downsampled_index = np.hstack((class_0_downsampled, class_1))
downsampled_index
Out[7]:
array([ 7, 38, 34, 15,  2, 18,  1, 49, 28,  5, 13, 43, 33, 35, 31, 45, 12,
       23,  9,  0, 29, 41, 24,  6, 14, 27, 44, 10, 11, 26, 50, 51, 52, 53,
       54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
       71, 72, 73, 74, 75, 76, 77, 78, 79])
In [8]:
# Sort the index
downsampled_index.sort()
downsampled_index
Out[8]:
array([ 0,  1,  2,  5,  6,  7,  9, 10, 11, 12, 13, 14, 15, 18, 23, 24, 26,
       27, 28, 29, 31, 33, 34, 35, 38, 41, 43, 44, 45, 49, 50, 51, 52, 53,
       54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
       71, 72, 73, 74, 75, 76, 77, 78, 79])
In [9]:
# Down sample dataset
X, y = X[downsampled_index], y[downsampled_index]

# View new dataset sizes
print(f'DataFrame Shape of X: {len(X)}')
print(f'DataFrame Shape of y: {len(y)}')
DataFrame Shape of X: 60
DataFrame Shape of y: 60
In [10]:
# View the class counts of y
pd.Series(y).value_counts()
Out[10]:
1    30
0    30
dtype: int64
In [1]:
def binary_random_downsample(X, y):
    '''
    Function conduct downsampling for binary crossentropy problem
    between to classes
    
    Parameters
    ----------
    X: numpy matrix containing all the features of the dataset
    y: numpy 1D array containing all target values of the dataset 
    
    Examples
    ----------
    >>>> X, y = binary_random_downsample(X,y)
    '''
    
    assert type(X) == np.ndarray, 'Make sure X is a numpy array'
    assert type(y) == np.ndarray, "Make sure y is a numpy array"
    
    class_0 = np.where(y==0)[0]
    class_1 = np.where(y==1)[0]
    
    if len(class_0) > len(class_1):
        magc = class_0
        minc = class_1
    else:
        magc = class_1
        minc = class_0
    
    mag_class_downsampled = np.random.choice(magc, 
                                       size=len(minc), 
                                       replace=False)
    
    downsampled_index = np.hstack((mag_class_downsampled, minc))
    downsampled_index.sort()
    X, y = X[downsampled_index], y[downsampled_index]
    
    return X,y
  File "<ipython-input-1-92cca87187bd>", line 1
    def binary_random_downsample(X->str, y->int):
                                   ^
SyntaxError: invalid syntax

Testing our Function

In [12]:
# Import data and run our function, print the results
iris = load_iris()
X, y = iris.data[:80], iris.target[:80]
X, y = binary_random_downsample(X,y)

print(f'DataFrame Shape of X: {len(X)}')
print(f'DataFrame Shape of y: {len(y)}')
DataFrame Shape of X: 60
DataFrame Shape of y: 60

Author: Kavi Sekhon