05 Sep

Dropping Features

Dropping feature is a common task in cleaning data. I use Pandas to drop the majority of features and observations in my workflow. Below is an example of a column of drop statements in Python. You can find the full documentation on dropping null value on Pandas' documents (link below).

Import Preliminaries

In [1]:
# Import modules
import pandas as pd

Import Data

In [62]:
# Import data
df = pd.read_csv('Data/Pokemon.csv')

# View the head of the dataframe 
df.head()
Out[62]:
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

Dropping Feature

In [63]:
# Drop the "name" feature and view the head of the dataframe
df.drop('Name', axis=1).head()
Out[63]:
# Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Grass Poison 525 80 82 83 100 100 80 1 False
3 3 Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Fire NaN 309 39 52 43 60 50 65 1 False

Dropping Observations

In [64]:
# Drop the third observations from the DataFrame 
df.drop(3, axis=0).head()
Out[64]:
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
Filtering Observations

While you can drop the observations using the drop function in Pandas, but its just faster to just use loc and list filtering available in the package.

In [65]:
# View the all the observations where the Type is not 'Grass' via filtering
df[df['Type 1'] != 'Grass'].head()
Out[65]:
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
6 6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False
7 6 CharizardMega Charizard X Fire Dragon 634 78 130 111 130 85 100 1 False
8 6 CharizardMega Charizard Y Fire Flying 634 78 104 78 159 115 100 1 False
Loc Statements
In [66]:
# Select only a subset of the datasets using the loc statement
df.loc[:2, 'Name':'HP']
Out[66]:
Name Type 1 Type 2 Total HP
0 Bulbasaur Grass Poison 318 45
1 Ivysaur Grass Poison 405 60
2 Venusaur Grass Poison 525 80

Dropping Mulptile Features

In [67]:
# Drop the "Name" feature and view the head of the dataframe
df.drop(['#','Name', 'Generation','Legendary'] ,axis=1).head()
Out[67]:
Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed
0 Grass Poison 318 45 49 49 65 65 45
1 Grass Poison 405 60 62 63 80 80 60
2 Grass Poison 525 80 82 83 100 100 80
3 Grass Poison 625 80 100 123 122 120 80
4 Fire NaN 309 39 52 43 60 50 65

Dropping Columns Inplace

By default the inplace parameter for all the drop functions is set to False, but your can pass in opposite Bollean value very easily.

In [68]:
# Copy the DataFrame
df_copy = df.copy()

# Drop the Name column
df_copy.drop('Name', axis=1, inplace=True)

# View the head of the dataframe 
df_copy.head()
Out[68]:
# Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Grass Poison 525 80 82 83 100 100 80 1 False
3 3 Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Fire NaN 309 39 52 43 60 50 65 1 False

Dropping Index

If you want to reset your index the best way to do this would be to just use the reset index function. But if you have dataframe with the multindex we can also us the drop datapoints related to the index value

In [75]:
# Filtered dataframe for Grass Pokemon
df[(df['Type 1'] == 'Grass') | (df['Type 2'] == 'Grass')].head()
Out[75]:
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
48 43 Oddish Grass Poison 320 45 50 55 75 65 30 1 False
In [76]:
# Filtered dataframe for Grass Pokemon + reset index
df[(df['Type 1'] == 'Grass') | (df['Type 2'] == 'Grass')].reset_index().head()
Out[76]:
index # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 48 43 Oddish Grass Poison 320 45 50 55 75 65 30 1 False
In [91]:
# Creating a Multindex DataFrame via the Group by Function
sum_stats = df.groupby(['Type 1','Type 2'])['Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed'].mean()
sum_stats.head(19)
Out[91]:
Total HP Attack Defense Sp. Atk Sp. Def Speed
Type 1 Type 2
Bug Electric 395.500000 60.000000 62.000000 55.000000 77.000000 55.000000 86.500000
Fighting 550.000000 80.000000 155.000000 95.000000 40.000000 100.000000 80.000000
Fire 455.000000 70.000000 72.500000 60.000000 92.500000 80.000000 80.000000
Flying 419.500000 63.000000 70.142857 61.571429 72.857143 69.071429 82.857143
Ghost 236.000000 1.000000 90.000000 45.000000 30.000000 30.000000 40.000000
Grass 384.000000 55.000000 73.833333 76.666667 57.333333 76.666667 44.500000
Ground 345.000000 45.500000 62.000000 97.500000 44.500000 57.500000 38.000000
Poison 347.916667 53.750000 68.333333 58.083333 42.500000 59.333333 65.916667
Rock 435.000000 46.666667 56.666667 146.666667 36.666667 113.333333 35.000000
Steel 509.714286 67.714286 114.714286 112.428571 68.142857 83.285714 63.428571
Water 269.000000 40.000000 30.000000 32.000000 50.000000 52.000000 65.000000
Dark Dragon 440.000000 72.000000 85.000000 70.000000 78.333333 70.000000 64.666667
Fighting 418.000000 57.500000 82.500000 92.500000 40.000000 92.500000 53.000000
Fire 476.666667 65.000000 80.000000 56.666667 110.000000 73.333333 91.666667
Flying 494.000000 93.200000 92.200000 73.800000 84.200000 70.400000 80.200000
Ghost 430.000000 50.000000 80.000000 100.000000 75.000000 90.000000 35.000000
Ice 470.000000 62.500000 107.500000 60.000000 40.000000 80.000000 120.000000
Psychic 385.000000 69.500000 73.000000 70.500000 52.500000 60.500000 59.000000
Steel 415.000000 55.000000 105.000000 85.000000 50.000000 55.000000 65.000000
In [105]:
# Drop "bug" type  from first level of index
sum_stats.drop('Bug' , axis=0,  level=0).head(19)
Out[105]:
Total HP Attack Defense Sp. Atk Sp. Def Speed
Type 1 Type 2
Dark Dragon 440.000000 72.000000 85.000000 70.000000 78.333333 70.000000 64.666667
Fighting 418.000000 57.500000 82.500000 92.500000 40.000000 92.500000 53.000000
Fire 476.666667 65.000000 80.000000 56.666667 110.000000 73.333333 91.666667
Flying 494.000000 93.200000 92.200000 73.800000 84.200000 70.400000 80.200000
Ghost 430.000000 50.000000 80.000000 100.000000 75.000000 90.000000 35.000000
Ice 470.000000 62.500000 107.500000 60.000000 40.000000 80.000000 120.000000
Psychic 385.000000 69.500000 73.000000 70.500000 52.500000 60.500000 59.000000
Steel 415.000000 55.000000 105.000000 85.000000 50.000000 55.000000 65.000000
Dragon Electric 680.000000 100.000000 150.000000 120.000000 120.000000 100.000000 90.000000
Fairy 590.000000 75.000000 110.000000 110.000000 110.000000 105.000000 80.000000
Fire 680.000000 100.000000 120.000000 100.000000 150.000000 120.000000 90.000000
Flying 641.666667 94.333333 135.666667 97.500000 121.666667 94.166667 98.333333
Ground 522.000000 90.000000 112.000000 88.200000 74.200000 75.000000 82.600000
Ice 686.666667 125.000000 140.000000 93.333333 140.000000 93.333333 95.000000
Psychic 650.000000 80.000000 100.000000 97.500000 135.000000 127.500000 110.000000
Electric Dragon 610.000000 90.000000 95.000000 105.000000 165.000000 110.000000 45.000000
Fairy 431.000000 67.000000 58.000000 57.000000 81.000000 67.000000 101.000000
Fire 520.000000 50.000000 65.000000 107.000000 105.000000 107.000000 86.000000
Flying 537.600000 70.600000 90.000000 78.400000 115.000000 83.400000 100.200000
In [107]:
# Drop "fire"  type from second level of index
sum_stats.drop('Fire' , axis=0,  level=1).head(19)
Out[107]:
Total HP Attack Defense Sp. Atk Sp. Def Speed
Type 1 Type 2
Bug Electric 395.500000 60.000000 62.000000 55.000000 77.000000 55.000000 86.500000
Fighting 550.000000 80.000000 155.000000 95.000000 40.000000 100.000000 80.000000
Flying 419.500000 63.000000 70.142857 61.571429 72.857143 69.071429 82.857143
Ghost 236.000000 1.000000 90.000000 45.000000 30.000000 30.000000 40.000000
Grass 384.000000 55.000000 73.833333 76.666667 57.333333 76.666667 44.500000
Ground 345.000000 45.500000 62.000000 97.500000 44.500000 57.500000 38.000000
Poison 347.916667 53.750000 68.333333 58.083333 42.500000 59.333333 65.916667
Rock 435.000000 46.666667 56.666667 146.666667 36.666667 113.333333 35.000000
Steel 509.714286 67.714286 114.714286 112.428571 68.142857 83.285714 63.428571
Water 269.000000 40.000000 30.000000 32.000000 50.000000 52.000000 65.000000
Dark Dragon 440.000000 72.000000 85.000000 70.000000 78.333333 70.000000 64.666667
Fighting 418.000000 57.500000 82.500000 92.500000 40.000000 92.500000 53.000000
Flying 494.000000 93.200000 92.200000 73.800000 84.200000 70.400000 80.200000
Ghost 430.000000 50.000000 80.000000 100.000000 75.000000 90.000000 35.000000
Ice 470.000000 62.500000 107.500000 60.000000 40.000000 80.000000 120.000000
Psychic 385.000000 69.500000 73.000000 70.500000 52.500000 60.500000 59.000000
Steel 415.000000 55.000000 105.000000 85.000000 50.000000 55.000000 65.000000
Dragon Electric 680.000000 100.000000 150.000000 120.000000 120.000000 100.000000 90.000000
Fairy 590.000000 75.000000 110.000000 110.000000 110.000000 105.000000 80.000000

Author: Kavi Sekhon