Summary of Data (Part 2)
Get basic ideas on the outliers/average of your data!
Summary of Data (Part 2)¶
This post suggests the revised functionalites offered by the summary
function described previously in the Summary of Data. The functions are mainly available at the preprocess.py file. We use a simple set of data 'beijing_201802_201803_aq.csv' from KDD CUP of Fresh Air, which provides air qualities measured by the weather stations in Beijing during February, March 2018.
The following mainly presents two functionalities:
- warn_missing - Describe the proportion of missing values
- summary - Provide the descriptive summary of data distribution. The visualization of data distribution can exclude outliers (filter_outlier=True) specified by a set of quantiles.
To test the execution of functions, plese clone the repository jqlearning and work on the jupyter notebook: Summary-of-data2.ipynb.
# Import functions and load data into a dataframe
import sys
sys.path.append("../")
import pandas as pd
from script.preprocess import summary, warn_missing
kwargs = {"parse_dates": ["utc_time"]}
bj_aq_df = pd.read_csv("beijing_201802_201803_aq.csv", **kwargs)
warn_missing(bj_aq_df, "beijing_201802_201803_aq.csv")
The following summary considers the top 1% and bottom 1% of given numeric data as outliers. summary
function offers us a simple idea on the type of data and the numeric data distribution.
summary(bj_aq_df, quantile=(0.01, 0.99), outlier_as_nan=True, filter_outlier=True)
User can also set outlier_as_nan=False parameter to display the outliers of data. This setting basically set the non-outliers as NaN (not a number). This offers us the convenience to understand the bigger picture of our available data before we proceed further to uncover additional insights!
summary(bj_aq_df, quantile=(0.01, 0.99), outlier_as_nan=False, filter_outlier=True)