Scalable Transit Delay Prediction at City Scale: A Systematic Approach with Multi-Resolution Feature Engineering and Deep Learning

Urban bus transit agencies need reliable, network-wide delay predictions to provide accurate arrival information to passengers and support real-time operational control. Accurate predictions help passengers plan their trips, reduce waiting time, and allow operations staff to adjust headways, dispatch extra vehicles, and manage disruptions. Although real-time feeds such as GTFS-Realtime (GTFS-RT) are now widely available, most existing delay prediction systems handle only a few routes, depend on hand-crafted features, and offer little guidance on how to design a scalable, reusable architecture. We present a city-scale prediction pipeline that combines multi-resolution feature engineering, dimensionality reduction, and deep learning. The framework generates 1,683 spatiotemporal features by exploring 23 aggregation combinations over H3 cells, routes, segments, and temporal patterns, and compresses them into 83 components using Adaptive PCA while preserving 95% of the variance. To avoid the "giant cluster" problem that occurs when dense urban areas fall into a single H3 region, we introduce a hybrid H3+topology clustering method that yields 12 balanced route clusters (coefficient of variation 0.608) and enables efficient distributed training. We compare five model architectures on six months of bus operations from the Société de transport de Montréal (STM) network in Montréal. A global LSTM with cluster-aware features achieves the best trade-off between accuracy and efficiency, outperforming transformer models by 18 to 52% while using 275 times fewer parameters. We also report multi-level evaluation at the elementary segment, segment, and trip level with walk-forward validation and latency analysis, showing that the proposed pipeline is suitable for real-time, city-scale deployment and can be reused for other networks with limited adaptation.

翻译：城市公交运营机构需要可靠的全网络延误预测，以便为乘客提供准确的到站信息并支持实时运营调控。精确的预测有助于乘客规划行程、减少等待时间，并帮助运营人员调整发车间隔、调度额外车辆及应对突发状况。尽管实时数据流（如GTFS-Realtime）现已广泛可用，但现有延误预测系统大多仅能处理少数线路，依赖人工构建特征，且缺乏关于如何设计可扩展、可复用架构的指导。本文提出一个城市级预测流水线，融合了多分辨率特征工程、降维技术与深度学习。该框架通过探索H3地理单元、线路、路段及时间模式的23种聚合组合，生成1,683个时空特征，并采用自适应主成分分析将其压缩为83个特征分量，同时保留95%的方差。为避免密集城区落入单一H3区域导致的"巨型聚类"问题，我们提出一种混合H3+拓扑聚类方法，生成12个均衡的线路聚类（变异系数0.608），从而实现高效的分布式训练。基于蒙特利尔STM公交网络六个月的运营数据，我们比较了五种模型架构。采用聚类感知特征的全局LSTM模型在精度与效率间取得了最佳平衡，其性能较Transformer模型提升18%至52%，而参数量仅为后者的1/275。通过前向验证与延迟分析，我们在基础路段、路段及行程三个层面进行了多级评估，结果表明所提流水线适用于实时城市级部署，且经过有限适配即可复用于其他交通网络。