Visualization of Bus Bunching

Bokeh plot based on the spatio-temporal info of New York city bus data.

Visualization of Bus Bunching

When we are working on spatio-temporal data sets, it will be handy if we can visualize the spatial components of data while understanding their relations with time series. In this post, I present an example of how to visualize the bus bunching (buses of the same service number arriving at the same stop) of New York city.

The original data set are available at New York City Bus Data. In addition, we download the bus stop information of New York City

This jupyter notebook is available at my Github page: VisualizeBusBunching.ipynb, and it is part of the repository jqlearning.

Preprocessing Data

  • Before loading the city bus data of 2017-06, an additional preprocessing step is required. The file 'mta_1706.csv' has additional ',' that prevents us to load the correct number of columns with pd.read_csv(). For instance, if you refer to the line 53292, you will notice that information such as 'CPW/110st (non-public, for GEO)'. The comma that follows the word non-public causes the loading error. We can thus remove this additional comma by Regular Expressions.
  • The data set of 2017-06 has about 1 Gb size. For simplicity, we focus on the records of 2017-06-01, and filter any row record having missing values.
  • Some vehicle recods could have duplicate records due to different scheduled arrival time values. We should filter the duplicate records by choosing only one of the vehicle records.
In [1]:
import pandas as pd
import numpy as np
In [2]:
df = pd.read_csv("mta_1706.csv")
In [3]:
# Set to datetime object
df['RecordedAtTime'] = pd.to_datetime(df['RecordedAtTime'])
In [4]:
df = df[(df['RecordedAtTime'] < pd.Timestamp('2017-06-02')) & (df['RecordedAtTime'] > pd.Timestamp('2017-05-31'))]
In [5]:
# filter missing values
df = df.dropna(axis=0, how='any')
In [6]:
# BusCoord records both Longitude and Latidue info
df['BusCoord'] = list(zip(df["VehicleLocation.Longitude"], df["VehicleLocation.Latitude"]))
In [7]:
df.head(5)
Out[7]:
RecordedAtTime DirectionRef PublishedLineName OriginName OriginLat OriginLong DestinationName DestinationLat DestinationLong VehicleRef VehicleLocation.Latitude VehicleLocation.Longitude NextStopPointName ArrivalProximityText DistanceFromStop ExpectedArrivalTime ScheduledArrivalTime BusCoord
0 2017-06-01 00:03:34 0 B8 4 AV/95 ST 40.616104 -74.031143 BROWNSVILLE ROCKAWAY AV 40.656048 -73.907379 NYCT_430 40.635170 -73.960803 FOSTER AV/E 18 ST approaching 76.0 2017-06-01 00:03:59 24:06:14 (-73.960803, 40.63517)
1 2017-06-01 00:03:43 1 S61 ST GEORGE FERRY/S61 & S91 40.643169 -74.073494 S I MALL YUKON AV 40.575935 -74.167686 NYCT_8263 40.590802 -74.158340 MERRYMOUNT ST/TRAVIS AV approaching 62.0 2017-06-01 00:03:56 23:58:02 (-74.15834, 40.590802000000004)
2 2017-06-01 00:03:49 0 Bx10 E 206 ST/BAINBRIDGE AV 40.875008 -73.880142 RIVERDALE 263 ST 40.912376 -73.902534 NYCT_4223 40.886010 -73.912647 HENRY HUDSON PKY E/W 235 ST at stop 5.0 2017-06-01 00:03:56 24:00:53 (-73.91264699999999, 40.88601)
3 2017-06-01 00:03:31 0 Q5 TEARDROP/LAYOVER 40.701748 -73.802399 ROSEDALE LIRR STA via MERRICK 40.666012 -73.735939 NYCT_8422 40.668002 -73.729348 HOOK CREEK BL/SUNRISE HY < 1 stop away 267.0 2017-06-01 00:04:03 24:03:00 (-73.729348, 40.668002)
4 2017-06-01 00:03:22 1 Bx1 RIVERDALE AV/W 231 ST 40.881187 -73.909340 MOTT HAVEN 136 ST via CONCOURSE 40.809654 -73.928360 NYCT_4710 40.868134 -73.893032 GRAND CONCOURSE/E 196 ST at stop 11.0 2017-06-01 00:03:56 23:59:38 (-73.89303199999999, 40.868134000000005)
In [8]:
vehicle_gb = df.groupby(["RecordedAtTime", "PublishedLineName", "DirectionRef", "VehicleRef"])
In [9]:
vehicle_cnt_df = vehicle_gb.count()
In [10]:
# The following demonstrates an example of duplicate record
vehicle_gb.get_group(vehicle_cnt_df[vehicle_cnt_df["BusCoord"] > 1].index[0])
Out[10]:
RecordedAtTime DirectionRef PublishedLineName OriginName OriginLat OriginLong DestinationName DestinationLat DestinationLong VehicleRef VehicleLocation.Latitude VehicleLocation.Longitude NextStopPointName ArrivalProximityText DistanceFromStop ExpectedArrivalTime ScheduledArrivalTime BusCoord
181 2017-06-01 00:03:28 1 S46 ST GEORGE FERRY/S46 & S96 40.643429 -74.073654 W SHORE PLZ via CASTLETON 40.601971 -74.19133 NYCT_8188 40.637748 -74.076454 VICTORY BL/MONTGOMERY AV approaching 112.0 2017-06-01 00:04:03 00:02:22 (-74.076454, 40.637747999999995)
182 2017-06-01 00:03:28 1 S46 ST GEORGE FERRY/S46 & S96 40.643429 -74.073654 W SHORE PLZ via CASTLETON 40.601971 -74.19133 NYCT_8188 40.637748 -74.076454 VICTORY BL/MONTGOMERY AV approaching 112.0 2017-06-01 00:04:03 00:02:26 (-74.076454, 40.637747999999995)
In [11]:
bus_df = vehicle_gb.head(1).copy()

New York City Bus Stop Info

In [12]:
stop_bronx_df = pd.read_csv("stops_bronx.txt")
stop_brooklyn_df = pd.read_csv("stops_brooklyn.txt")
stop_manhattan_df = pd.read_csv("stops_manhattan.txt")
stop_queens_df = pd.read_csv("stops_queens.txt")
stop_staten_island_df = pd.read_csv("stops_staten_island.txt")
In [13]:
stop_new_york_df = pd.concat([stop_bronx_df, 
                              stop_brooklyn_df,
                             stop_manhattan_df,
                             stop_queens_df,
                             stop_staten_island_df], axis=0)
In [14]:
stop_new_york_df.drop_duplicates(inplace=True)

Determine the Bus Bunching Based on the NextStopPointName

In [15]:
bus_at_stop_df =  bus_df[bus_df["ArrivalProximityText"] == "at stop"]
In [16]:
# For bus bunching, buses of the same service number arriving at the same stop
bus_at_stop_gb = bus_at_stop_df.groupby(["RecordedAtTime", "PublishedLineName", "DirectionRef", "NextStopPointName"])
bus_at_stop_cnt = bus_at_stop_gb.count()
In [17]:
# Bunched buses have multiple locations
bunched_bus_df = bus_at_stop_cnt[bus_at_stop_cnt["BusCoord"] > 1]
In [18]:
# An example of a bus bunchin scenario
bus_at_stop_gb.get_group(bunched_bus_df.index[0])
Out[18]:
RecordedAtTime DirectionRef PublishedLineName OriginName OriginLat OriginLong DestinationName DestinationLat DestinationLong VehicleRef VehicleLocation.Latitude VehicleLocation.Longitude NextStopPointName ArrivalProximityText DistanceFromStop ExpectedArrivalTime ScheduledArrivalTime BusCoord
16857 2017-06-01 06:13:45 0 Q58 PALMETTO ST/MYRTLE AV 40.700176 -73.910255 LTD FLUSHING MAIN ST 40.757343 -73.829361 NYCT_4520 40.757344 -73.829513 41 RD/MAIN ST at stop 10.0 2017-06-01 06:13:50 06:02:00 (-73.829513, 40.757344)
17113 2017-06-01 06:13:45 0 Q58 PALMETTO ST/MYRTLE AV 40.700176 -73.910255 FLUSHING MAIN ST 40.757343 -73.829361 NYCT_6563 40.757285 -73.829638 41 RD/MAIN ST at stop 23.0 2017-06-01 06:13:50 06:08:00 (-73.829638, 40.757284999999996)

Prepare Data for Visualization

  • Assign bus bunching status for all the bus records.
  • Group the RecordedAtTime into 30 seconds interval.
  • Convert the latitude, longitude information since Bokeh employs Web Mercator projection for mapping. Credit goes to Charlie Harper's post on Visualizing Data with Bokeh and Pandas, where he detailed a helper function for the conversion using Pyproj.
In [19]:
# Initilialize as False
bus_df["BunchedStatus"] = False
bus_df["BunchedStatus"] = bus_df.apply(lambda x: True if (x["RecordedAtTime"], 
                                                      x["PublishedLineName"], 
                                                      x["DirectionRef"],
                                                      x["NextStopPointName"]) in bunched_bus_df.index else False, axis=1)
In [20]:
bus_df['TimeInterval'] = bus_df['RecordedAtTime'].map(lambda x: x.floor('30s'))
In [21]:
from pyproj import Proj, transform 

# helper function to convert lat/long to easting/northing for mapping
def LongLat_to_EN(long, lat):
    try:
        easting, northing = transform(Proj(init='epsg:4326'), Proj(init='epsg:3857'), long, lat)
        return easting, northing
    except:
        return None, None
In [22]:
bus_df['VehicleLocation.E'], bus_df['VehicleLocation.N'] = zip(*bus_df.apply(
    lambda x: LongLat_to_EN(x['VehicleLocation.Longitude'], x['VehicleLocation.Latitude']), axis=1))
In [23]:
stop_new_york_df["stop.E"], stop_new_york_df["stop.N"] = zip(*stop_new_york_df.apply(
    lambda x: LongLat_to_EN(x["stop_lon"], x["stop_lat"]), axis=1))

Visualization with Bokeh

In [24]:
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.tile_providers import CARTODBPOSITRON
from bokeh.io import output_notebook, push_notebook
from bokeh.models import HoverTool

output_notebook()
Loading BokehJS ...
In [25]:
def busSources(datetime):
    # given a datetime, separate bunched bus and non-bunched bus sources.
    
    # Selected data sourceNonBunched for bunched bus
    sourceBunched = ColumnDataSource(data=dict(
        lon=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.E"],
        lat=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.N"],
        PublishedLineName=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["PublishedLineName"],
        DirectionRef=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["DirectionRef"],
        VehicleRef=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["VehicleRef"],
        RecordedAtTime=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["RecordedAtTime"],
        NextStopPoint=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["NextStopPointName"]
    ))
    
    # Selected data sourceNonBunchedfor non-bunching bus
    sourceNonBunched = ColumnDataSource(data=dict(
        lon=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.E"],
        lat=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.N"],
        PublishedLineName=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["PublishedLineName"],
        DirectionRef=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["DirectionRef"],
        VehicleRef=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["VehicleRef"],
        RecordedAtTime=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["RecordedAtTime"],
        NextStopPoint=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["NextStopPointName"]        
    ))
    return sourceBunched, sourceNonBunched
In [26]:
def visualize_selected_time(datetime):
    plot = figure(x_range=(bus_df["VehicleLocation.E"].min(), bus_df["VehicleLocation.E"].max()), 
               y_range=(bus_df["VehicleLocation.N"].min(), bus_df["VehicleLocation.N"].max()),
               x_axis_type="mercator", y_axis_type="mercator")
    plot.add_tile(CARTODBPOSITRON)
    sourceBunched, sourceNonBunched = busSources(datetime)
            
    # Bus Stop Info
    bus_stops = ColumnDataSource(data=dict(
        lon=stop_new_york_df["stop.E"],
        lat=stop_new_york_df["stop.N"],
        Name=stop_new_york_df["stop_name"]))

    circle1 = plot.circle('lon', 'lat', size=2.5, color="orange", alpha=0.3, source=bus_stops)
    # add a circle renderer with a size, color, and alpha
    circle2 = plot.circle('lon', 'lat', size=5, color="navy", alpha=0.5, source=sourceNonBunched)
    circle3 = plot.circle('lon', 'lat', size=8, color="red", alpha=0.5, source=sourceBunched)
    
    plot.add_tools(HoverTool(renderers=[circle2, circle3], tooltips=[("BusService", "@PublishedLineName"), 
                                    ("Direction", "@DirectionRef"),
                                    ("Vehicle", "@VehicleRef"),
                                    # The timestamp will be automatically converted to epoch time by default
                                    # https://bokeh.pydata.org/en/latest/docs/reference/models/formatters.html#bokeh.models.formatters.NumeralTickFormatter.format
                                    ("Time", "@RecordedAtTime{%F %T}"),
                                    ("NextStopPoint", "@NextStopPoint")],
                                    formatters={"RecordedAtTime": "datetime"}))
    
    plot.add_tools(HoverTool(renderers=[circle1], tooltips=[
                                    ("Stop Name", "@Name")]))        
    
    return plot, sourceBunched, sourceNonBunched
In [27]:
plot, sourceBunched, sourceNonBunched = visualize_selected_time(pd.Timestamp("20170601 12:44:00"))
  • The following plot visualizes the bus bunching instance at period around 12:44:00. The bunched bus instances are highlighted by red, whereas the other bus records are makred by blue. There are thousands of bus stops available at the New York city, and they are shown as orange circles in this plot
In [28]:
show(plot, notebook_handle=True)
Out[28]:

<Bokeh Notebook handle for In[28]>

Add An Interactive Dropdown Menu

  • Finally, we can use ipywidgets to generate a dropdown menu, which allows us to visualize the potential bus bunching at any selected time interval.
  • Note that this interactive menu requires a functioning local server. You may download this notebook and play with the visualization using the following dropdown menu.
In [29]:
TimeIntervalStr = bus_df["TimeInterval"].astype(str)
UniqueTimeIntervalStr = TimeIntervalStr.unique()
UniqueTimeIntervalStr = sorted(UniqueTimeIntervalStr)
In [30]:
def update_plot(datetime="2017-06-01 12:44:00"):
    timestamp = pd.Timestamp(datetime)
    newBunched, newNonBunched = busSources(timestamp)
    sourceBunched.data = newBunched.data
    sourceNonBunched.data = newNonBunched.data
    push_notebook()
In [31]:
from ipywidgets import interact

interact_panel = interact(update_plot, datetime=UniqueTimeIntervalStr)