Cell Toolbar:
## Introduction
This notebook is the result of a short investigation of the 2015 Pronto Data Challenge.
It is intended to be a demonstration of how IPython/Jupyter notebooks
can be easily combined with the pandas, spark and d3 libraries based on previously existing demonstrations.
This notebook contains queries to generate two charts.
One explores trip flows between different regions of the Pronto system.
The second recreates a chart for a performance metric that has been measured in other bike share systems in the US.
Hopefully this notebook can be used to gain some quick insights into Pronto's first year, but it it is intended more to be a starting point for more queries.
This notebook is the result of a short investigation of the 2015 Pronto Data Challenge. It is intended to be a demonstration of how IPython/Jupyter notebooks can be easily combined with the pandas, spark and d3 libraries based on previously existing demonstrations.
This notebook contains queries to generate two charts. One explores trip flows between different regions of the Pronto system. The second recreates a chart for a performance metric that has been measured in other bike share systems in the US.
Hopefully this notebook can be used to gain some quick insights into Pronto's first year, but it it is intended more to be a starting point for more queries.
## Library Initialization
To begin with, the system needs to be initialized and the libraries loaded. This notebook relies on a combination of Pandas dataframes and Spark SQL queries to generate data for the D3 javascript and Matplotlib libraries. It is based on examples that are referenced in the code comments.
This section can be skipped if you are more interested in the results than the underlying technology.
To begin with, the system needs to be initialized and the libraries loaded. This notebook relies on a combination of Pandas dataframes and Spark SQL queries to generate data for the D3 javascript and Matplotlib libraries. It is based on examples that are referenced in the code comments.
This section can be skipped if you are more interested in the results than the underlying technology.
# see http://stackoverflow.com/questions/29783520/create-pyspark-profile-for-ipython
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="ProntoChallenge")
# see http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
import pandas as pd
import numpy as np
Load the data. Pandas currently is currently more convenient than Spark at reading csv files of the format provided by Pronto, so we'll read them as Pandas dataframes.
Load the data. Pandas currently is currently more convenient than Spark at reading csv files of the format provided by Pronto, so we'll read them as Pandas dataframes.
station_df = pd.read_csv('2015_station_data.csv')
trip_df = pd.read_csv('2015_trip_data.csv')
trip_df.dropna()
weather_df = pd.read_csv('2015_weather_data.csv')
# it's bigger dataset, and not used in this example.
#status_df = pd.read_csv('2015_status_data.csv')
## Visualization of Inter- and Intra-region Trips
This notebook does not do automated clustering on the data. But as a quick alternative, it turns out the station ids have a region label in them. The following lines make that an explicit column in the station table.
This notebook does not do automated clustering on the data. But as a quick alternative, it turns out the station ids have a region label in them. The following lines make that an explicit column in the station table.
trip_df.loc[:,'from_region'] = trip_df.loc[:,'from_station_id'].map(lambda r: str(r).split('-')[0])
trip_df.loc[:,'to_region'] = trip_df.loc[:,'to_station_id'].map(lambda r: str(r).split('-')[0])
station_df.loc[:,'region'] = station_df.loc[:,'terminal'].map(lambda r: str(r).split('-')[0])
The Pandas library can be used alone, but in some situations the SQL language is more useful. The Spark library converts easily between Pandas and Spark SQL Dataframes. The following are SQL versions of the tables.
The Pandas library can be used alone, but in some situations the SQL language is more useful. The Spark library converts easily between Pandas and Spark SQL Dataframes. The following are SQL versions of the tables.
stations_sdf = sqlContext.createDataFrame(station_df)
stations_sdf.registerTempTable('stations')
trip_sdf = sqlContext.createDataFrame(trip_df)
trip_sdf.registerTempTable('trips')
The SQL tables can now be used to group and count the trips by region labels. Pandas could be used to express this query as well, though the specific query would be different.
The SQL tables can now be used to group and count the trips by region labels. Pandas could be used to express this query as well, though the specific query would be different.
regions_sql = """
SELECT from_region, to_region, count(*) as trips
FROM trips
GROUP BY from_region, to_region
"""
region_result = sqlContext.sql(regions_sql)
One benefit of Pandas dataframes is they display automatically as a table in Jupyter notebooks.
This is just the first five lines of the region result.
One benefit of Pandas dataframes is they display automatically as a table in Jupyter notebooks. This is just the first five lines of the region result.
region_result.toPandas()[:5]
This table is easier to interpret as a cross-tabulation.
The cross-tabulation will also make it easier to display as a chart later.
This table is easier to interpret as a cross-tabulation. The cross-tabulation will also make it easier to display as a chart later.
region_to_from_df = region_result.toPandas()
region_crosstab_df = pd.crosstab(trip_df.from_region, trip_df.to_region)
region_crosstab_df
For use with the javascript D3 visualization library, one last format conversion to JSON is helpful.
For use with the javascript D3 visualization library, one last format conversion to JSON is helpful.
import json
# see http://stackoverflow.com/questions/3488934/simplejson-and-numpy-array
class NumpyAwareJSONEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, np.ndarray) and obj.ndim == 1:
return obj.tolist()
region_json = json.dumps(region_crosstab_df.as_matrix().tolist(), cls=NumpyAwareJSONEncoder)
region_labels = list(region_crosstab_df.columns.values)
The following section is based on a demonstration by Ariel Keselman on integrating the D3 javascript library with IPython/Jupyter notebooks, specifically with chord charts.
Note that in the current version of this notebook, the json matrix and region labels have been copied to the javascript template by hand.
The following section is based on a demonstration by Ariel Keselman on integrating the D3 javascript library with IPython/Jupyter notebooks, specifically with chord charts.
Note that in the current version of this notebook, the json matrix and region labels have been copied to the javascript template by hand.
# This section is modified from https://github.com/skariel/IPython_d3_js_demo/blob/master/d3_js_demo.ipynb
from IPython.display import IFrame
import re
def replace_all(txt,d):
rep = dict((re.escape('{'+k+'}'), str(v)) for k, v in d.items())
pattern = re.compile("|".join(rep.keys()))
return pattern.sub(lambda m: rep[re.escape(m.group(0))], txt)
count=0
def serve_html(s,w,h):
import os
global count
count+=1
fn= '__tmp'+str(os.getpid())+'_'+str(count)+'.html'
with open(fn,'w') as f:
f.write(s)
fn = '__tmp22236_1.html'
return IFrame('files/'+fn,w,h)
def chord_chart(w=800,
h=800,
ball_count=150,
rad_min=2,
rad_fac=11,
color_count=14):
d={
'width' :w,
'height' :h,
'ball_count' :ball_count,
'rad_min' :rad_min,
'rad_fac' :rad_fac,
'color_count':color_count
}
with open('chord.template','r') as f:
s=f.read()
s= replace_all(s,d)
return serve_html(s,w+30,h+30)
## Note: unfortunately the chart will not render in a static notebook, such as on github.
The arcs around the perimeter represent the region labels for the stations, and their size represents the number of trips. The paths between regions represent all the intra-region trips. The width of the path where it touches a region arc is the number of trips originating in that region. If you hover over a region trips in other regions will be dimmed.
At first glance this chart has a few noticable features. The large number of intra-region trips stands out, though several smaller regions are exceptions (First Hill, Belltown and Central Business District. Some regions, such as UW and UD are tightly connected only to each other. Capitol Hill does originate more trips to South Lake Union, the Central Business District and Belltown than return, but the effect doesn't dominate the chart. Pioneer Square has very even connections to other regions, which may be a function of the Second Avenue protected bike lane.
The arcs around the perimeter represent the region labels for the stations, and their size represents the number of trips. The paths between regions represent all the intra-region trips. The width of the path where it touches a region arc is the number of trips originating in that region. If you hover over a region trips in other regions will be dimmed.
At first glance this chart has a few noticable features. The large number of intra-region trips stands out, though several smaller regions are exceptions (First Hill, Belltown and Central Business District. Some regions, such as UW and UD are tightly connected only to each other. Capitol Hill does originate more trips to South Lake Union, the Central Business District and Belltown than return, but the effect doesn't dominate the chart. Pioneer Square has very even connections to other regions, which may be a function of the Second Avenue protected bike lane.
chord_chart(ball_count=30, color_count=14, rad_fac=10, rad_min=3, w=800, h=800)
The thing to remember is the size of the regions in the chart above is based on the number of trips. Another query can show how many stations are in each region. This query also assigns random colors (different from the chart above) for use later.
The thing to remember is the size of the regions in the chart above is based on the number of trips. Another query can show how many stations are in each region. This query also assigns random colors (different from the chart above) for use later.
stations_per_region_sql = """
SELECT region, count(*) as n_stations
FROM stations
GROUP BY region
"""
stations_per_region = sqlContext.sql(stations_per_region_sql).toPandas()
import random
random.seed(20151115)
stations_per_region.loc[:,'color'] = stations_per_region.iloc[:,0].map(
lambda n: '#%02X%02X%02X' % (random.randint(0,255),
random.randint(0,255),
random.randint(0,255)))
stations_per_region
This shows that South Lake Union is a group of only ten stations. It's the same for Capitol Hill. UW and the University District combined are also 10 stations. Pioneer square has only two stations, but they seem to be serving a distinct purpose, as are the four BT stations and the two waterfront stations.
My immediate concern is whether we currently have three network fragments (UW/UD, Capitol Hill and SLU/CBD/BT) rather than one dense network. The data does have enough information to check how our network is performing compared to some existing networks in the US. This will be the topic of the next chart.
This shows that South Lake Union is a group of only ten stations. It's the same for Capitol Hill. UW and the University District combined are also 10 stations. Pioneer square has only two stations, but they seem to be serving a distinct purpose, as are the four BT stations and the two waterfront stations.
My immediate concern is whether we currently have three network fragments (UW/UD, Capitol Hill and SLU/CBD/BT) rather than one dense network. The data does have enough information to check how our network is performing compared to some existing networks in the US. This will be the topic of the next chart.
## 15 Minute Destinations
Analysis of other bike share networks shows that usage statistics rise non-linearly with station density. http://nacto.org/wp-content/uploads/2015/09/NACTO_Walkable-Station-Spacing-Is-Key-For-Bike-Share_Sc.pdf
The following query simply counts station pairs
that have had a trip of less than 15 minutes.
They are then grouped and sorted from most destinations to least.
Analysis of other bike share networks shows that usage statistics rise non-linearly with station density. http://nacto.org/wp-content/uploads/2015/09/NACTO_Walkable-Station-Spacing-Is-Key-For-Bike-Share_Sc.pdf
The following query simply counts station pairs that have had a trip of less than 15 minutes. They are then grouped and sorted from most destinations to least.
stations_within_15_min_sql = """
SELECT DISTINCT from_station_id as station_id,
from_station_name as station_name,
count(DISTINCT to_station_id) as destinations
FROM trips
WHERE tripduration <= (15*60)
AND from_station_id != to_station_id
GROUP BY from_station_id, from_station_name
ORDER BY destinations DESC
"""
n_destinations_df = sqlContext.sql(stations_within_15_min_sql).toPandas()
n_destinations_df
This shows that there are 10 stations that are reasonably well connected (40+ destinations). The two busy waterfront stations actually only have a mediocre number of destinations (19 and 23). The University stations have very low connectivity and Children's Hospital is an outlier at the edge of the network.
The graphs in the NACTO paper suggest that stations with 40 destinations could expect to have around 10-30 departures per day. This can also be checked.
This shows that there are 10 stations that are reasonably well connected (40+ destinations). The two busy waterfront stations actually only have a mediocre number of destinations (19 and 23). The University stations have very low connectivity and Children's Hospital is an outlier at the edge of the network.
The graphs in the NACTO paper suggest that stations with 40 destinations could expect to have around 10-30 departures per day. This can also be checked.
departures_per_day_sql = """
SELECT from_station_id as station_id, count(*)/365 as departures
FROM trips
GROUP BY from_station_id
ORDER BY departures DESC
"""
arrivals_per_day_sql = """
SELECT to_station_id as station_id, count(*)/365 as arrivals
FROM trips
GROUP BY to_station_id
ORDER BY arrivals DESC
"""
departures_df = sqlContext.sql(departures_per_day_sql).toPandas()
arrivals_df = sqlContext.sql(arrivals_per_day_sql).toPandas()
n_trips_df = arrivals_df.merge(departures_df).merge(n_destinations_df)
n_trips_df.loc[:,'trips_to_from'] = n_trips_df.loc[:,'arrivals'] + n_trips_df.loc[:,'departures']
# merge in region, stations_per_region and region color
n_trips_df.loc[:,'region'] = n_trips_df.loc[:,'station_id'].map(lambda r: str(r).split('-')[0])
n_trips_df = n_trips_df.merge(stations_per_region.loc[:,['region','color']], on='region')
n_trips_df
%matplotlib inline
import matplotlib.pyplot as plt
_, rides = plt.subplots()
rides.grid(color='grey',
linestyle='dotted')
scatter = rides.scatter(n_trips_df.loc[:,'destinations'],
n_trips_df.loc[:,'trips_to_from'],
c=n_trips_df.loc[:,'color'],
s = n_trips_df.loc[:,'destinations'])
rides.set_xlabel('other docks within 15 minute ride')
rides.set_ylabel('rides per day to/from dock')
rides.set_title("15 Minute Rides", size=14);
This plot looks similar to the plots in the NACTO paper. The stations with the most destinations have between 10 and 30 rides per day. The more remote stations appear to have fewer rides. Because this notebook did not correct for stations that were added during the year, a few stations such as the new station at 9th and Mercer do not have an accurate rides per day number.
Note also that the region colors in this plot do not correspond to the colors in the chord chart above, and there is no color/region key. Still it is possible to see that the different regions have slightly different usage characteristics beyond station density and region size.
This plot looks similar to the plots in the NACTO paper. The stations with the most destinations have between 10 and 30 rides per day. The more remote stations appear to have fewer rides. Because this notebook did not correct for stations that were added during the year, a few stations such as the new station at 9th and Mercer do not have an accurate rides per day number.
Note also that the region colors in this plot do not correspond to the colors in the chord chart above, and there is no color/region key. Still it is possible to see that the different regions have slightly different usage characteristics beyond station density and region size.
## Conclusions and Next Steps
This notebook is intended to provides some quick insight into Pronto performance after a year of operation and toin demonstrate some useful technologies for continuing investigation.
The inter-region trip chart seems to show that there is some fragmentation and hill effects in the network, but they are not out of scale when compared to intra-region trips in the larger regions and the well-connected central business district, Belltown and even pioneer square regions.
The 15 minute rides chart shows that our network has similar performance to other bike share systems, but with the caveat that less dense and peripheral stations do not have enough destinations see high usage.
The Pandas, Spark and D3 libraries make it easy to manipulate and visualize this data. Similar charts can be generated by demographics, time of day or year with slight modifications to the query. It would be interesting to see the charts with automated region clusters rather than the region labels from the station table.
This notebook is intended to provides some quick insight into Pronto performance after a year of operation and toin demonstrate some useful technologies for continuing investigation.
The inter-region trip chart seems to show that there is some fragmentation and hill effects in the network, but they are not out of scale when compared to intra-region trips in the larger regions and the well-connected central business district, Belltown and even pioneer square regions.
The 15 minute rides chart shows that our network has similar performance to other bike share systems, but with the caveat that less dense and peripheral stations do not have enough destinations see high usage.
The Pandas, Spark and D3 libraries make it easy to manipulate and visualize this data. Similar charts can be generated by demographics, time of day or year with slight modifications to the query. It would be interesting to see the charts with automated region clusters rather than the region labels from the station table.