Anticipating traffic accidents from dashcam videos is a critical challenge in intelligent transportation systems. Existing methods typically map visual context directly to a collision probability without explicitly modeling the future evolution of the driving scene. In this paper we propose FLaRA (Predicting Future Latent Representations for Accident Anticipation), a novel predictive architecture that shifts this paradigm by forecasting future latent representations for accident anticipation. Building upon the Video Joint-Embedding Predictive Architecture (V-JEPA2), our model conditions a predictor network on observed context frames to predict the forthcoming latent features of the scene. A classifier then operates on these predicted future representations rather than only on past observations. To ensure these forecasts remain grounded in realistic future dynamics, we introduce a joint training objective that simultaneously optimizes an auxiliary feature-level reconstruction loss and a cross-entropy classification loss. Extensive evaluations on the Nexar dataset, alongside cross-domain validations on the DAD, DADA-2000, and DoTA benchmarks, demonstrate that our approach achieves state-of-the-art performance while maintaining realistic early warning capabilities.
翻译:从行车记录仪视频中预警交通事故是智能交通系统的一项关键挑战。现有方法通常将视觉上下文直接映射为碰撞概率,而无需显式建模驾驶场景的未来演变。本文提出FLaRA(预测未来潜在表征用于事故预警)——一种新颖的预测架构,通过预测未来潜在表征来转变这一范式。该模型基于视频联合嵌入预测架构(V-JEPA2),利用观察到的上下文帧驱动预测器网络,以预估场景即将出现的潜在特征。随后,分类器基于这些预测的未来表征(而非仅基于历史观测)进行操作。为确保预测植根于真实的未来动态,我们引入联合训练目标,同时优化辅助性的特征级重建损失与交叉熵分类损失。在Nexar数据集上的广泛评估,以及跨DAD、DADA-2000和DoTA基准的域间验证表明,我们的方法在保持实时预警能力的同时实现了最先进的性能。