Summary of Data
Get a rough idea on your new data!
Assuming that we have a new data set which has been cleaned and organized following a standard format such as comma-separated value (CSV). This data set is likely to have both text and numeric information. A text data can refer to a categorical value such as the occupation of a customer (doctor, police and etc.). Meanwhile, a numeric data can suggest us the distribution of a continuous values assigned to a feature.
In order to understand the features of a new data set, I designed a function: summary() using Pandas and Matplotlib packages to plot the histogram distribution of the numeric data and describe the basic information of the text data.
summary() can warn us of any missing data or entries marked as NaN (Not a Number) found in the imported CSV file, with the help of isnull() function in Pandas. In the following, I present the summary of the training set data (train.csv) used for the Kaggle competition: "Predict West Nile virus in mosquitos across the city of Chicago".
In this way, other than having a brief idea on the distribution of numeric data, one can also learn about the total number of unique values representing the text data. We could also easily identify which entry appears the most for a given feature. The following example code demonstrates the usage of DataIn instance to summarize the imported data.
# Obtain the summary of train.csv
# train.csv file downloaded from https://www.kaggle.com/c/predict-west-nile-virus/data
%matplotlib inline
import sys
sys.path.append("../bin/")
from data import DataIn
if __name__ == "__main__":
train = DataIn("train.csv")
train.summarize()