Dealing with Missing Data

Why do we need to convert missing records to NaN (Not a Number)?

A newly collected data set is always messy such that it has missing values (empty record), or numeric data entries marked with other symbols (like '-', 'M') indicating null values. Pandas package is handy in this case such that it can automatically convert the missing values to "Not A Number" - NaN. Users can still focus on the other important features without omitting a particular data entry just because one of its features has a missing value.

For example, let us refer to the weather.csv data available for the Kaggle competition: “Predict West Nile virus in mosquitos across the city of Chicago”. The weather data describes weather conditions which may affect the presence of West Nile virus. If we refer to the summary of this data set using the summarize() of DataIn Instance, we only have 6 features marked as numeric data as shown below.

In [1]:
# weather.csv file downloaded from https://www.kaggle.com/c/predict-west-nile-virus/data

%matplotlib inline
import sys
sys.path.append("../bin/")
from data import DataIn
import pandas as pd
from preprocess import summary

if __name__ == "__main__":
    weather = DataIn("weather.csv")
    weather.summarize()                                                              
No missing values in the columns of weather.csv!

--------------------------------------------------------------------------------
********************    Begin of the summary of text data   ********************
--------------------------------------------------------------------------------
count           2944
unique          1472
top       2011-08-18
freq               2
Name: Date, dtype: object


count     2944
unique      60
top         73
freq       138
Name: Tavg, dtype: object


count     2944
unique      42
top          M
freq      1472
Name: Depart, dtype: object


count     2944
unique      48
top         63
freq       135
Name: WetBulb, dtype: object


count     2944
unique      31
top          0
freq      1870
Name: Heat, dtype: object


count     2944
unique      31
top          0
freq      1147
Name: Cool, dtype: object


count     2944
unique     122
top          -
freq      1472
Name: Sunrise, dtype: object


count     2944
unique     119
top          -
freq      1472
Name: Sunset, dtype: object


count     2944
unique      98
top           
freq      1609
Name: CodeSum, dtype: object


count     2944
unique       2
top          0
freq      1472
Name: Depth, dtype: object


count     2944
unique       1
top          M
freq      2944
Name: Water1, dtype: object


count     2944
unique       4
top          M
freq      1472
Name: SnowFall, dtype: object


count     2944
unique     168
top       0.00
freq      1577
Name: PrecipTotal, dtype: object


count      2944
unique      104
top       29.34
freq        128
Name: StnPressure, dtype: object


count      2944
unique      102
top       30.00
freq         96
Name: SeaLevel, dtype: object


count     2944
unique     178
top        6.9
freq        63
Name: AvgSpeed, dtype: object


--------------------------------------------------------------------------------
********************    End of the summary of text data     ********************
--------------------------------------------------------------------------------

If we have a closer inspection of the raw data, we notice that the feature "Depart" has a lot of missing entries marked as M instead. Besides, the features of "Sunrise" and "Sunset" have missing entries marked as -. By default, if we import the .csv file into a Pandas DataFrame, feature having both text (because of 'M' - missing) and numeric data will be treated as an object data type, and processed as text data.

Pandas's to_numeric(errors='coerce') method becomes handy in this case that we can easily convert the entries with numbers to numeric data type, and mark those non-numeirc entries of the same feature as NaN instead. By converting these entries to NaN, we can continue to perform mathematical operations on these features by ignoring the NaN records. One can also then easily use the dropna() method of DataFrame to remove any row entries in which a null value (NaN) is present. For this purpose, I designed a helper function: numeric() to achieve these.

    def numeric(self, clean=True):
        """Convert the features with partial numeric records to numeric type.

        Parameters
        ----------
        clean : boolean, default True
            If True, cleaning all row entries where NaN found.
            If False, no cleaning of row entries with NaN found. 
        """
        colnames = self.df.columns
        temp = self.df.apply(pd.to_numeric, args=('coerce',))
        removed = []
        for x in colnames:
            # Check columns which have all values as NaN
            if (temp[x].isnull().values.all()):
                removed.append(x)

        colnames = colnames.drop(removed)
        # Replace the text values with numeric values
        self.df[colnames] = temp[colnames]

        # Remove all row entries if presence of NaN found
        if clean:
            self.df = self.df.dropna()

The main idea of this helper function is to use pd.to_numeric(errors='coerce') in the conversion. Upon the conversion, those features having all their values marked as NaN will suggest that they are of object data type, and thus numeric conversion is not suitable. The numeric function will only convert those features having part of their data in numeric format. The following demonstrates the usage of numeric() on the weather (DataIn instance), and we notice that the number of data entries reduced from 2944 to 1294 counts. Features "Date", "CodeSum", and "Water1" are correctly retained as object type. There is a risk of discarding data entries having a limited number of NaN values, and this is not optimum for the analysis purpose. For instance, in the following chart, we only find that the "Station" of the remaining records belong to "Station=1" only, while the complete data have "Station=1" and "Station=2". In fact, "Station" feature is more suitable to be treated as a categorical feature instead. Nonetheless, we can understand the proportion of our data which has complete records without missing data, and also the distribution of those categorical features with their labels encoded. If we want to apply Imputer function in sklearn to complete the missing values (or NaN) using strategies such as replacing those with the average values, we can simply call numeric(clean=False) instead to retain all the records. The example code of this post is available on my GitHub page as well.

In [2]:
# Convert the data to numeric type and remove NaN entries             
weather.numeric()
weather.summarize()
--------------------------------------------------------------------------------
********************    Begin of the summary of text data   ********************
--------------------------------------------------------------------------------
count           1294
unique          1294
top       2008-10-30
freq               1
Name: Date, dtype: object


count     1294
unique      62
top           
freq       730
Name: CodeSum, dtype: object


count     1294
unique       1
top          M
freq      1294
Name: Water1, dtype: object


--------------------------------------------------------------------------------
********************    End of the summary of text data     ********************
--------------------------------------------------------------------------------