MUNI Transit ML Pipeline Overview
Motivation
There is a real-time stream of rich data collected and published by San Francisco’s MTA through a GTFS API. I use SFMTA a lot and so I started playing with using my Raspberry Pis for data collection to see what is available. I am now moving to analyze trends in the data with state of the art models.
The best part is that I can employ my RPi Spark Cluster to handle the data processing.
Goals
- Reduce cloud costs by pushing as much computation as possible to edge devices.
Structure
The structure is as follows, and it is conceived around two constraints:
- To limit cost overhead, I don’t want a constant EC2 instance running;
- To cut out unneccessary IO to S3 buckets (hence use of MinIO)
-
Extract:
- RPi_1 (Caddy) fetches from SFMTA API,
- every 60s updates /mnt/ssd/hot_muni_data/muni_data.son, to be served on Dockerized FastAPI server,
- every 180s, ships the json to Quentin who stores it with weather data in /mnt/ssd/raw/vehicles.
- RPi_2 (Quentin) fetches weather data every 30 minutes and stores on 128GB ssd at /mnt/ssd/raw/weather.
- RPi_1 (Caddy) fetches from SFMTA API,
-
Transform:
- At the end of every day, a Spark job is triggered on my RPi Spark Cluster that:
- Joins real-time MUNI data with weather data and MUNI static arrival estimates.
- Tokenizes the resulting data.
- Writes that day’s data to /mnt/ssd/processed on Quentin’s ssd.
- At the end of every day, a Spark job is triggered on my RPi Spark Cluster that:
-
Load:
- Future work includes loading tokenized data into a classification pipeline (e.g., scikit-learn or PySpark ML).
Stages
- Enabling Spark-Cluster IO with S3 buckets (done, to be written up);
- Configuring MinIO bucket on USB storage device;
- Configuring cron jobs for fetch and transform/push (potentially jump straight to Airflow).