02 Aug

## Downsampling

In supervised learning, many datasets contain data that is class imbalanced. Therefore you will have to downsample the majority class to match the minority class shape. Below I have created a function that can downsample numpy data.

### Import Preliminaries¶

In [2]:
# Import modules
import numpy as np
import pandas as pd


### Import Data¶

In [3]:
# Import data
X, y = iris.data[:80], iris.target[:80]


### View Class Imbalance¶

In [4]:
# View value counts
pd.Series(y).value_counts()

Out[4]:
0    50
1    30
dtype: int64

### Retrieve Downsampled Class Index¶

In [5]:
# Indicies of each class' observation
class_0 = np.where(y==0)[0]
class_1 = np.where(y==1)[0]
class_1[:5], class_0[:5]

Out[5]:
(array([50, 51, 52, 53, 54]), array([0, 1, 2, 3, 4]))
In [6]:
# Random select indices from the magority class to fit
# the minority class
class_0_downsampled = np.random.choice(class_0,
size=len(class_1),
replace=False)
class_0_downsampled

Out[6]:
array([ 7, 38, 34, 15,  2, 18,  1, 49, 28,  5, 13, 43, 33, 35, 31, 45, 12,
23,  9,  0, 29, 41, 24,  6, 14, 27, 44, 10, 11, 26])
In [7]:
# Stack the minority class index and dowmsampled majority class
downsampled_index = np.hstack((class_0_downsampled, class_1))
downsampled_index

Out[7]:
array([ 7, 38, 34, 15,  2, 18,  1, 49, 28,  5, 13, 43, 33, 35, 31, 45, 12,
23,  9,  0, 29, 41, 24,  6, 14, 27, 44, 10, 11, 26, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79])
In [8]:
# Sort the index
downsampled_index.sort()
downsampled_index

Out[8]:
array([ 0,  1,  2,  5,  6,  7,  9, 10, 11, 12, 13, 14, 15, 18, 23, 24, 26,
27, 28, 29, 31, 33, 34, 35, 38, 41, 43, 44, 45, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79])
In [9]:
# Down sample dataset
X, y = X[downsampled_index], y[downsampled_index]

# View new dataset sizes
print(f'DataFrame Shape of X: {len(X)}')
print(f'DataFrame Shape of y: {len(y)}')

DataFrame Shape of X: 60
DataFrame Shape of y: 60

In [10]:
# View the class counts of y
pd.Series(y).value_counts()

Out[10]:
1    30
0    30
dtype: int64
In [1]:
def binary_random_downsample(X, y):
'''
Function conduct downsampling for binary crossentropy problem
between to classes

Parameters
----------
X: numpy matrix containing all the features of the dataset
y: numpy 1D array containing all target values of the dataset

Examples
----------
>>>> X, y = binary_random_downsample(X,y)
'''

assert type(X) == np.ndarray, 'Make sure X is a numpy array'
assert type(y) == np.ndarray, "Make sure y is a numpy array"

class_0 = np.where(y==0)[0]
class_1 = np.where(y==1)[0]

if len(class_0) > len(class_1):
magc = class_0
minc = class_1
else:
magc = class_1
minc = class_0

mag_class_downsampled = np.random.choice(magc,
size=len(minc),
replace=False)

downsampled_index = np.hstack((mag_class_downsampled, minc))
downsampled_index.sort()
X, y = X[downsampled_index], y[downsampled_index]

return X,y

  File "<ipython-input-1-92cca87187bd>", line 1
def binary_random_downsample(X->str, y->int):
^
SyntaxError: invalid syntax


### Testing our Function¶

In [12]:
# Import data and run our function, print the results

DataFrame Shape of X: 60