Sparse Matrix
A sparse matrix is a different approach to hold dataframes that are contain many zeros within their value counts for the datasets feature. A sparse matrix can hold these data in a smaller format then a dataframe by not storing all the "0" values but instead storing their location in the dataframe, which is done in a manner that save us memory, so we can process large dataframe with the memory savings
Import Preliminaries¶
%matplotlib inline
%config InlineBackend.figure_format='retina'
# Import modules
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib as mpl
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import warnings
from scipy.sparse import csr_matrix
# Import Model Selection
from sklearn.model_selection import train_test_split, cross_val_score
# Set pandas options
pd.set_option('max_columns',1000)
pd.set_option('max_rows',30)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# Set plotting options
mpl.rcParams['figure.figsize'] = (8.0, 7.0)
Create a DataFrame¶
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5000,
'number': np.random.choice([0,1], size=10000),
'value': np.random.choice([0,1], size=10000),
})
df.head(10)
Encode Data¶
df = pd.get_dummies(df)
df.head()
DataFrame to Spare Matrix¶
sparse_matrix = csr_matrix(df.values)
feature_names = df.columns
sparse_matrix
Spare Matrix to DataFrame¶
df = pd.DataFrame(sparse_matrix.todense(), columns=feature_names)
df.head(10)
Author: Kavi Sekhon