Scatter Plot for Time Series Data in Groups
Understand your data with respect to their group labels.
Scatter Plot for Time-Series Data in Groups¶
During the data exploratory phase, we might want to investigate the behavior of data with respect to their labels (groups). The designed grp_ts_scatter
from preprocess.py offers the following functionalities:
- Scatter plot of your data with respect to the group labels of a chosen feature.
- Mark the missing values with red markers.
In addition, I design a helper function nan_zeroes
to identify the distribution of missing values with respect to x-axis.
This jupyter notebok is available at my Github page: Scatter-Plot-for-Time-Series-Data-in-Groups.ipynb, and it is included as part of the repository jqlearning
# Import functions and load data into a dataframe
import sys
sys.path.append("../")
import pandas as pd
from script.preprocess import grp_ts_scatter, nan_zeroes
kwargs = {"parse_dates": ["utc_time"]}
bj_aq_df = pd.read_csv("beijing_201802_201803_aq.csv", **kwargs)
The following scatter plots are based on the available data.
Based on the following plots, we learn that stationId of zhiwuyuan_aq does not have any data.
grp_ts_scatter(bj_aq_df, "utc_time", "O3",
grp="stationId", col_wrap=5, markersize=2, display_nan=False)
The following scatter plots, using the option display_nan=True
, can plot the location of missing values (marked by red at zeroes).
This suggests the distribution of missing values across the groups.
grp_ts_scatter(bj_aq_df, "utc_time", "O3",
grp="stationId", col_wrap=5, markersize=4, display_nan=True)
If you would like to investigate further by zooming to data points, you could consider an interactive library such as Bokeh
# Bokeh plot
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.models import HoverTool
output_notebook()
# nan_zeroes will add a column named "nan_O3", which marks the missing values at zeroes
bj_aq_df = nan_zeroes(bj_aq_df, 'O3')
The following interactive plot allows you to hover over the data points and inspect their values
p = figure(x_axis_type='datetime')
# The x-axis values converted to unix-epoch time automatically
# The following @x{%F %T} converts it back to readable datetime format
p.add_tools(HoverTool(tooltips=[("datetime", "@x{%F %T}"),
("O3", "@y")], formatters={"x": "datetime"}))
# add a circle renderer with a size, color, and alpha
p.circle(bj_aq_df[bj_aq_df['stationId'] == 'aotizhongxin_aq']['utc_time'],
bj_aq_df[bj_aq_df['stationId'] == 'aotizhongxin_aq']['O3'],
size=3, color="navy", alpha=0.5)
p.circle(bj_aq_df[bj_aq_df['stationId'] == 'aotizhongxin_aq']['utc_time'],
bj_aq_df[bj_aq_df['stationId'] == 'aotizhongxin_aq']['nan_O3'],
size=3, color="red", alpha=0.5)
show(p)