Dealing with Missing Data
Why do we need to convert missing records to NaN (Not a Number)?
A newly collected data set is always messy such that it has missing values (empty record), or numeric data entries marked with other symbols (like '-', 'M') indicating null values. Pandas package is handy in this case such that it can automatically convert the missing values to "Not A Number" - NaN. Users can still focus on the other important features without omitting a particular data entry just because one of its features has a missing value.
For example, let us refer to the weather.csv data available for the Kaggle competition: “Predict West Nile virus in mosquitos across the city of Chicago”. The weather data describes weather conditions which may affect the presence of West Nile virus. If we refer to the summary of this data set using the summarize() of DataIn Instance, we only have 6 features marked as numeric data as shown below.
# weather.csv file downloaded from https://www.kaggle.com/c/predict-west-nile-virus/data
%matplotlib inline
import sys
sys.path.append("../bin/")
from data import DataIn
import pandas as pd
from preprocess import summary
if __name__ == "__main__":
weather = DataIn("weather.csv")
weather.summarize()
If we have a closer inspection of the raw data, we notice that the feature "Depart" has a lot of missing entries marked as M instead. Besides, the features of "Sunrise" and "Sunset" have missing entries marked as -. By default, if we import the .csv file into a Pandas DataFrame, feature having both text (because of 'M' - missing) and numeric data will be treated as an object data type, and processed as text data.
Pandas's to_numeric(errors='coerce') method becomes handy in this case that we can easily convert the entries with numbers to numeric data type, and mark those non-numeirc entries of the same feature as NaN instead. By converting these entries to NaN, we can continue to perform mathematical operations on these features by ignoring the NaN records. One can also then easily use the dropna() method of DataFrame to remove any row entries in which a null value (NaN) is present. For this purpose, I designed a helper function: numeric() to achieve these.
def numeric(self, clean=True):
"""Convert the features with partial numeric records to numeric type.
Parameters
----------
clean : boolean, default True
If True, cleaning all row entries where NaN found.
If False, no cleaning of row entries with NaN found.
"""
colnames = self.df.columns
temp = self.df.apply(pd.to_numeric, args=('coerce',))
removed = []
for x in colnames:
# Check columns which have all values as NaN
if (temp[x].isnull().values.all()):
removed.append(x)
colnames = colnames.drop(removed)
# Replace the text values with numeric values
self.df[colnames] = temp[colnames]
# Remove all row entries if presence of NaN found
if clean:
self.df = self.df.dropna()
The main idea of this helper function is to use pd.to_numeric(errors='coerce') in the conversion. Upon the conversion, those features having all their values marked as NaN will suggest that they are of object data type, and thus numeric conversion is not suitable. The numeric function will only convert those features having part of their data in numeric format. The following demonstrates the usage of numeric() on the weather (DataIn instance), and we notice that the number of data entries reduced from 2944 to 1294 counts. Features "Date", "CodeSum", and "Water1" are correctly retained as object type. There is a risk of discarding data entries having a limited number of NaN values, and this is not optimum for the analysis purpose. For instance, in the following chart, we only find that the "Station" of the remaining records belong to "Station=1" only, while the complete data have "Station=1" and "Station=2". In fact, "Station" feature is more suitable to be treated as a categorical feature instead. Nonetheless, we can understand the proportion of our data which has complete records without missing data, and also the distribution of those categorical features with their labels encoded. If we want to apply Imputer function in sklearn to complete the missing values (or NaN) using strategies such as replacing those with the average values, we can simply call numeric(clean=False) instead to retain all the records. The example code of this post is available on my GitHub page as well.
# Convert the data to numeric type and remove NaN entries
weather.numeric()
weather.summarize()