01 Aug

Null Values

There are multiple ways to handle missing data. Some people come up with some very creative solutions. This notebook contains some basic methods to handle missing data. Again your strategy to handle missing data will different with contextual knowledge around the problem with your domain expertise.

**Basic Strategies**

  1. Removing observations
  2. Filling in NaN values with certain value
  3. Filling in NaN values with the mean
  4. Filling in NaN values with the median
  5. Dropping columns with missing values
  6. Dropping Features with NaN

The best strategy will normally be context specific. Therefore the more contextual knowledge you have the better.

Import Preliminaries

In [18]:
# Import modules
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import missingno as msno

Create Data

['student'] + str(i) for i in range(0,100)]
In [31]:
# Create some studend data

students = pd.DataFrame({'Name' : ['Student_' + str(i) for i in range(100)],
             'Midterm_Score' : np.random.randint(70, 100, size=100),
             'Final_Score' : np.random.randint(90, 100, size=100)
             })
# Create null value in for final scores
students.Final_Score.replace(to_replace=list(range(92,97)), value=np.nan,
                            inplace=True)

# View our dataframe
students.head(15)
Out[31]:
Name Midterm_Score Final_Score
0 Student_0 89 99.0
1 Student_1 84 91.0
2 Student_2 87 NaN
3 Student_3 71 NaN
4 Student_4 98 97.0
5 Student_5 86 NaN
6 Student_6 96 99.0
7 Student_7 96 97.0
8 Student_8 78 NaN
9 Student_9 89 NaN
10 Student_10 80 97.0
11 Student_11 94 91.0
12 Student_12 78 NaN
13 Student_13 72 NaN
14 Student_14 90 98.0

Visualization

In [34]:
# Barplot the frequecny of null values across feature 
msno.bar(students,  figsize=(10,5), fontsize=10);
plt.xlabel('Features')
plt.ylabel('Record Number');
In [33]:
# Plot the occurence null values across feature 
msno.matrix(students,  figsize=(10,5), fontsize=10);
plt.xlabel('Features')
plt.ylabel('Record Number');

Removing Rows with NaN Values

In [30]:
# Drop null values
students.dropna().head(15)
Out[30]:
Name Midterm_Score Final_Score
1 Student_1 80 99.0
2 Student_2 94 90.0
4 Student_4 76 98.0
5 Student_5 78 91.0
6 Student_6 70 98.0
7 Student_7 86 90.0
9 Student_9 87 90.0
10 Student_10 89 91.0
11 Student_11 97 98.0
12 Student_12 93 99.0
14 Student_14 90 90.0
16 Student_16 89 98.0
28 Student_28 83 98.0
30 Student_30 86 98.0
31 Student_31 77 98.0

Filling NaN Value with Another Value

In [7]:
# Fill Null values with another value 
students.fillna(0).head(15)
Out[7]:
Name Midterm_Score Final_Score
0 Justin 87 0.0
1 Kim 84 91.0
2 Stephen 72 0.0
3 Paul 94 0.0
4 Jean 72 0.0
5 Brian 87 99.0
6 John 70 0.0

Filling NaN Values with Mean

In [8]:
# Fill in null values with the mean
students.fillna(students.Final_Score.mean()).head(15)
Out[8]:
Name Midterm_Score Final_Score
0 Justin 87 95.0
1 Kim 84 91.0
2 Stephen 72 95.0
3 Paul 94 95.0
4 Jean 72 95.0
5 Brian 87 99.0
6 John 70 95.0

Filling NaN Values with Median

In [9]:
# Fill null values with the median
students.fillna(students.Final_Score.median()).head(15)
Out[9]:
Name Midterm_Score Final_Score
0 Justin 87 95.0
1 Kim 84 91.0
2 Stephen 72 95.0
3 Paul 94 95.0
4 Jean 72 95.0
5 Brian 87 99.0
6 John 70 95.0

Dropping Features with NaN

In [10]:
# Drop Feature that contain null valuse
students.dropna(axis=1).head(15)
Out[10]:
Name Midterm_Score
0 Justin 87
1 Kim 84
2 Stephen 72
3 Paul 94
4 Jean 72
5 Brian 87
6 John 70

Author: Kavi Sekhon