Visualization of Bus Bunching¶

When we are working on spatio-temporal data sets, it will be handy if we can visualize the spatial components of data while understanding their relations with time series. In this post, I present an example of how to visualize the bus bunching (buses of the same service number arriving at the same stop) of New York city.

The original data set are available at New York City Bus Data. In addition, we download the bus stop information of New York City

This jupyter notebook is available at my Github page: VisualizeBusBunching.ipynb, and it is part of the repository jqlearning.

Preprocessing Data¶

Before loading the city bus data of 2017-06, an additional preprocessing step is required. The file 'mta_1706.csv' has additional ',' that prevents us to load the correct number of columns with pd.read_csv(). For instance, if you refer to the line 53292, you will notice that information such as 'CPW/110st (non-public, for GEO)'. The comma that follows the word non-public causes the loading error. We can thus remove this additional comma by Regular Expressions.
The data set of 2017-06 has about 1 Gb size. For simplicity, we focus on the records of 2017-06-01, and filter any row record having missing values.
Some vehicle recods could have duplicate records due to different scheduled arrival time values. We should filter the duplicate records by choosing only one of the vehicle records.

import pandas as pd
import numpy as np

df = pd.read_csv("mta_1706.csv")

# Set to datetime object
df['RecordedAtTime'] = pd.to_datetime(df['RecordedAtTime'])

df = df[(df['RecordedAtTime'] < pd.Timestamp('2017-06-02')) & (df['RecordedAtTime'] > pd.Timestamp('2017-05-31'))]

# filter missing values
df = df.dropna(axis=0, how='any')

# BusCoord records both Longitude and Latidue info
df['BusCoord'] = list(zip(df["VehicleLocation.Longitude"], df["VehicleLocation.Latitude"]))

df.head(5)

vehicle_gb = df.groupby(["RecordedAtTime", "PublishedLineName", "DirectionRef", "VehicleRef"])

vehicle_cnt_df = vehicle_gb.count()

# The following demonstrates an example of duplicate record
vehicle_gb.get_group(vehicle_cnt_df[vehicle_cnt_df["BusCoord"] > 1].index[0])

bus_df = vehicle_gb.head(1).copy()

New York City Bus Stop Info¶

stop_bronx_df = pd.read_csv("stops_bronx.txt")
stop_brooklyn_df = pd.read_csv("stops_brooklyn.txt")
stop_manhattan_df = pd.read_csv("stops_manhattan.txt")
stop_queens_df = pd.read_csv("stops_queens.txt")
stop_staten_island_df = pd.read_csv("stops_staten_island.txt")

stop_new_york_df = pd.concat([stop_bronx_df, 
                              stop_brooklyn_df,
                             stop_manhattan_df,
                             stop_queens_df,
                             stop_staten_island_df], axis=0)

stop_new_york_df.drop_duplicates(inplace=True)

Determine the Bus Bunching Based on the NextStopPointName¶

bus_at_stop_df =  bus_df[bus_df["ArrivalProximityText"] == "at stop"]

# For bus bunching, buses of the same service number arriving at the same stop
bus_at_stop_gb = bus_at_stop_df.groupby(["RecordedAtTime", "PublishedLineName", "DirectionRef", "NextStopPointName"])
bus_at_stop_cnt = bus_at_stop_gb.count()

# Bunched buses have multiple locations
bunched_bus_df = bus_at_stop_cnt[bus_at_stop_cnt["BusCoord"] > 1]

# An example of a bus bunchin scenario
bus_at_stop_gb.get_group(bunched_bus_df.index[0])

Prepare Data for Visualization¶

Assign bus bunching status for all the bus records.
Group the RecordedAtTime into 30 seconds interval.
Convert the latitude, longitude information since Bokeh employs Web Mercator projection for mapping. Credit goes to Charlie Harper's post on Visualizing Data with Bokeh and Pandas, where he detailed a helper function for the conversion using Pyproj.

# Initilialize as False
bus_df["BunchedStatus"] = False
bus_df["BunchedStatus"] = bus_df.apply(lambda x: True if (x["RecordedAtTime"], 
                                                      x["PublishedLineName"], 
                                                      x["DirectionRef"],
                                                      x["NextStopPointName"]) in bunched_bus_df.index else False, axis=1)

bus_df['TimeInterval'] = bus_df['RecordedAtTime'].map(lambda x: x.floor('30s'))

from pyproj import Proj, transform 

# helper function to convert lat/long to easting/northing for mapping
def LongLat_to_EN(long, lat):
    try:
        easting, northing = transform(Proj(init='epsg:4326'), Proj(init='epsg:3857'), long, lat)
        return easting, northing
    except:
        return None, None

bus_df['VehicleLocation.E'], bus_df['VehicleLocation.N'] = zip(*bus_df.apply(
    lambda x: LongLat_to_EN(x['VehicleLocation.Longitude'], x['VehicleLocation.Latitude']), axis=1))

stop_new_york_df["stop.E"], stop_new_york_df["stop.N"] = zip(*stop_new_york_df.apply(
    lambda x: LongLat_to_EN(x["stop_lon"], x["stop_lat"]), axis=1))

Visualization with Bokeh¶

from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.tile_providers import CARTODBPOSITRON
from bokeh.io import output_notebook, push_notebook
from bokeh.models import HoverTool

output_notebook()

def busSources(datetime):
    # given a datetime, separate bunched bus and non-bunched bus sources.
    
    # Selected data sourceNonBunched for bunched bus
    sourceBunched = ColumnDataSource(data=dict(
        lon=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.E"],
        lat=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.N"],
        PublishedLineName=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["PublishedLineName"],
        DirectionRef=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["DirectionRef"],
        VehicleRef=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["VehicleRef"],
        RecordedAtTime=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["RecordedAtTime"],
        NextStopPoint=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["NextStopPointName"]
    ))
    
    # Selected data sourceNonBunchedfor non-bunching bus
    sourceNonBunched = ColumnDataSource(data=dict(
        lon=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.E"],
        lat=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.N"],
        PublishedLineName=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["PublishedLineName"],
        DirectionRef=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["DirectionRef"],
        VehicleRef=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["VehicleRef"],
        RecordedAtTime=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["RecordedAtTime"],
        NextStopPoint=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["NextStopPointName"]        
    ))
    return sourceBunched, sourceNonBunched

def visualize_selected_time(datetime):
    plot = figure(x_range=(bus_df["VehicleLocation.E"].min(), bus_df["VehicleLocation.E"].max()), 
               y_range=(bus_df["VehicleLocation.N"].min(), bus_df["VehicleLocation.N"].max()),
               x_axis_type="mercator", y_axis_type="mercator")
    plot.add_tile(CARTODBPOSITRON)
    sourceBunched, sourceNonBunched = busSources(datetime)
            
    # Bus Stop Info
    bus_stops = ColumnDataSource(data=dict(
        lon=stop_new_york_df["stop.E"],
        lat=stop_new_york_df["stop.N"],
        Name=stop_new_york_df["stop_name"]))

    circle1 = plot.circle('lon', 'lat', size=2.5, color="orange", alpha=0.3, source=bus_stops)
    # add a circle renderer with a size, color, and alpha
    circle2 = plot.circle('lon', 'lat', size=5, color="navy", alpha=0.5, source=sourceNonBunched)
    circle3 = plot.circle('lon', 'lat', size=8, color="red", alpha=0.5, source=sourceBunched)
    
    plot.add_tools(HoverTool(renderers=[circle2, circle3], tooltips=[("BusService", "@PublishedLineName"), 
                                    ("Direction", "@DirectionRef"),
                                    ("Vehicle", "@VehicleRef"),
                                    # The timestamp will be automatically converted to epoch time by default
                                    # https://bokeh.pydata.org/en/latest/docs/reference/models/formatters.html#bokeh.models.formatters.NumeralTickFormatter.format
                                    ("Time", "@RecordedAtTime{%F %T}"),
                                    ("NextStopPoint", "@NextStopPoint")],
                                    formatters={"RecordedAtTime": "datetime"}))
    
    plot.add_tools(HoverTool(renderers=[circle1], tooltips=[
                                    ("Stop Name", "@Name")]))        
    
    return plot, sourceBunched, sourceNonBunched

plot, sourceBunched, sourceNonBunched = visualize_selected_time(pd.Timestamp("20170601 12:44:00"))

The following plot visualizes the bus bunching instance at period around 12:44:00. The bunched bus instances are highlighted by red, whereas the other bus records are makred by blue. There are thousands of bus stops available at the New York city, and they are shown as orange circles in this plot

show(plot, notebook_handle=True)

Finally, we can use ipywidgets to generate a dropdown menu, which allows us to visualize the potential bus bunching at any selected time interval.
Note that this interactive menu requires a functioning local server. You may download this notebook and play with the visualization using the following dropdown menu.

TimeIntervalStr = bus_df["TimeInterval"].astype(str)
UniqueTimeIntervalStr = TimeIntervalStr.unique()
UniqueTimeIntervalStr = sorted(UniqueTimeIntervalStr)

def update_plot(datetime="2017-06-01 12:44:00"):
    timestamp = pd.Timestamp(datetime)
    newBunched, newNonBunched = busSources(timestamp)
    sourceBunched.data = newBunched.data
    sourceNonBunched.data = newNonBunched.data
    push_notebook()

from ipywidgets import interact

interact_panel = interact(update_plot, datetime=UniqueTimeIntervalStr)

	RecordedAtTime	DirectionRef	PublishedLineName	OriginName	OriginLat	OriginLong	DestinationName	DestinationLat	DestinationLong	VehicleRef	VehicleLocation.Latitude	VehicleLocation.Longitude	NextStopPointName	ArrivalProximityText	DistanceFromStop	ExpectedArrivalTime	ScheduledArrivalTime	BusCoord
0	2017-06-01 00:03:34	0	B8	4 AV/95 ST	40.616104	-74.031143	BROWNSVILLE ROCKAWAY AV	40.656048	-73.907379	NYCT_430	40.635170	-73.960803	FOSTER AV/E 18 ST	approaching	76.0	2017-06-01 00:03:59	24:06:14	(-73.960803, 40.63517)
1	2017-06-01 00:03:43	1	S61	ST GEORGE FERRY/S61 & S91	40.643169	-74.073494	S I MALL YUKON AV	40.575935	-74.167686	NYCT_8263	40.590802	-74.158340	MERRYMOUNT ST/TRAVIS AV	approaching	62.0	2017-06-01 00:03:56	23:58:02	(-74.15834, 40.590802000000004)
2	2017-06-01 00:03:49	0	Bx10	E 206 ST/BAINBRIDGE AV	40.875008	-73.880142	RIVERDALE 263 ST	40.912376	-73.902534	NYCT_4223	40.886010	-73.912647	HENRY HUDSON PKY E/W 235 ST	at stop	5.0	2017-06-01 00:03:56	24:00:53	(-73.91264699999999, 40.88601)
3	2017-06-01 00:03:31	0	Q5	TEARDROP/LAYOVER	40.701748	-73.802399	ROSEDALE LIRR STA via MERRICK	40.666012	-73.735939	NYCT_8422	40.668002	-73.729348	HOOK CREEK BL/SUNRISE HY	< 1 stop away	267.0	2017-06-01 00:04:03	24:03:00	(-73.729348, 40.668002)
4	2017-06-01 00:03:22	1	Bx1	RIVERDALE AV/W 231 ST	40.881187	-73.909340	MOTT HAVEN 136 ST via CONCOURSE	40.809654	-73.928360	NYCT_4710	40.868134	-73.893032	GRAND CONCOURSE/E 196 ST	at stop	11.0	2017-06-01 00:03:56	23:59:38	(-73.89303199999999, 40.868134000000005)

	RecordedAtTime	DirectionRef	PublishedLineName	OriginName	OriginLat	OriginLong	DestinationName	DestinationLat	DestinationLong	VehicleRef	VehicleLocation.Latitude	VehicleLocation.Longitude	NextStopPointName	ArrivalProximityText	DistanceFromStop	ExpectedArrivalTime	ScheduledArrivalTime	BusCoord
181	2017-06-01 00:03:28	1	S46	ST GEORGE FERRY/S46 & S96	40.643429	-74.073654	W SHORE PLZ via CASTLETON	40.601971	-74.19133	NYCT_8188	40.637748	-74.076454	VICTORY BL/MONTGOMERY AV	approaching	112.0	2017-06-01 00:04:03	00:02:22	(-74.076454, 40.637747999999995)
182	2017-06-01 00:03:28	1	S46	ST GEORGE FERRY/S46 & S96	40.643429	-74.073654	W SHORE PLZ via CASTLETON	40.601971	-74.19133	NYCT_8188	40.637748	-74.076454	VICTORY BL/MONTGOMERY AV	approaching	112.0	2017-06-01 00:04:03	00:02:26	(-74.076454, 40.637747999999995)

	RecordedAtTime	DirectionRef	PublishedLineName	OriginName	OriginLat	OriginLong	DestinationName	DestinationLat	DestinationLong	VehicleRef	VehicleLocation.Latitude	VehicleLocation.Longitude	NextStopPointName	ArrivalProximityText	DistanceFromStop	ExpectedArrivalTime	ScheduledArrivalTime	BusCoord
16857	2017-06-01 06:13:45	0	Q58	PALMETTO ST/MYRTLE AV	40.700176	-73.910255	LTD FLUSHING MAIN ST	40.757343	-73.829361	NYCT_4520	40.757344	-73.829513	41 RD/MAIN ST	at stop	10.0	2017-06-01 06:13:50	06:02:00	(-73.829513, 40.757344)
17113	2017-06-01 06:13:45	0	Q58	PALMETTO ST/MYRTLE AV	40.700176	-73.910255	FLUSHING MAIN ST	40.757343	-73.829361	NYCT_6563	40.757285	-73.829638	41 RD/MAIN ST	at stop	23.0	2017-06-01 06:13:50	06:08:00	(-73.829638, 40.757284999999996)