Contributing to Public Transit Data Analysis and Tooling
Author
Michael Chow
Published
December 19, 2024
Hello! Michael Chow here. In 2025, I’ll be focusing on contributing to open source data standards and tooling for public transit. If you work on analytics at a public transit agency, please reach out on linkedin or bluesky—I would love to hear about your work, and how I could be useful!
(I will also be at the TRB Annual Meeting 2025 in Washington D.C.)
At this point, you might be wondering three things:
why focus on open source in public transit?
why is this on the Great Tables blog?
what are you hoping to contribute?
Why focus on open source in public transit?
People doing analytics in public transit are active in developing open data standards (like GTFS, GTFS-RT, and TIDES). These open data sources are complex—they cover schedules that change from week to week, buses moving in realtime, and passenger events. As people like me work more and more on open source tools, we start to lose touch with data analysis in realistic, complex settings. Working on open source transit data is an opportunity for me to ensure my open source tooling work helps people solve real, complex problems.
An inspiration for this angle is the book R for Data Science, which uses realistic datasets—like NYC flights data—to teach data analysis using an ecosystem of packages called the Tidyverse. The Tidyverse packages have dozens of example datasets, and I think this focus on working through examples is part of what made their design so great.
A few years ago, I worked with the Cal-ITP project to build out a warehouse for their GTFS schedule and realtime data. This left a profound impression on me: transit data is perfect for educating on data analyses in R and Python, as well as analytics engineering with tools like dbt or sqlmesh. Many analysts in public transit are querying warehouses, which opens up interesting use-cases with tools like dbplyr (in R) and ibis (in Python).
Our 2022 Annual Posit Table Contest received the most incredible entry—an interactive table of Amtrak routes. The table uses the library reactable in R (which we recently ported to Python!).
Our 2024 contest saw a great set of transit tables submitted by Tiffany Ku at CalTrans (submission, github repo). The submission showed service patterns using stop times binned by hour. It produced these for all California operators!
Here is the reasoning given behind the table:
We can see what service patterns look like across operators by hour of the day. In one table, we can quickly see a summary snapshot of what this service pattern is and see the type of riders the operator typically serves. Here’s the same chart made in a paper about for transit in Calgary, Canada.
What are you hoping to contribute?
I’m hoping to focus on two things:
Workshops to support R and Python analyst teams (and help me appreciate their needs).
Collaboration on creating open source tools and guides for analyzing transit data.
Workshops
Beyond my obvious potential involvement as a table fanatic, I’m really interested in making the daily lives of R and Python users at public transit agencies easier.
Workshops on R. Think Tidyverse, Shiny, Quarto, querying warehouses.
Workshops on Python. Let’s say Polars, Quarto, publishing notebooks, Great Tables, dashboards.
Analytics engineering. How to get analysts to use your data models 😓.
I’m open to whatever topics seem most useful, even if they aren’t in the list above.
Scheduling a workshop
If you’re a public transit agency, reach out on linkedin or bluesky, and I will send my calendly for scheduling.
Collaboration
I’m interested in understanding major challenges analytics teams working on public transit face, and the kind of strategic and tooling support they’d most benefit from. If you’re working on analytics in public transit, I would love to hear about what you’re working on, and the tools you use most.
One topic I’ve discussed with a few agencies is ghost buses, which is when a bus is scheduled but never shows up. This is an interesting analysis because it combines GTFS schedule data with GTFS-RT realtime bus data.
Another is passenger events (e.g. people tapping on or off a bus). This data is challenging because different vendors data record and deliver this data in different ways. This can make it hard for analysts across agencies to discuss analyses—every analysis is different in its own way.
In summary
Analytics in public transit is a really neat, impactful area—with an active community working on open source data standards and tooling. As a data science tool builder on the open source team at Posit, PBC, my mission is to create value and give it away to teams using Python or R for code first data science. I’d love to support open source work in public transit however is most useful.