Traffic anomaly detection (TAD) in driving videos is critical for ensuring the safety of autonomous driving and advanced driver assistance systems. Previous single-stage TAD methods primarily rely on frame prediction, making them vulnerable to interference from dynamic backgrounds induced by the rapid movement of the dashboard camera. While two-stage TAD methods appear to be a natural solution to mitigate such interference by pre-extracting background-independent features (such as bounding boxes and optical flow) using perceptual algorithms, they are susceptible to the performance of first-stage perceptual algorithms and may result in error propagation. In this paper, we introduce TTHF, a novel single-stage method aligning video clips with text prompts, offering a new perspective on traffic anomaly detection. Unlike previous approaches, the supervised signal of our method is derived from languages rather than orthogonal one-hot vectors, providing a more comprehensive representation. Further, concerning visual representation, we propose to model the high frequency of driving videos in the temporal domain. This modeling captures the dynamic changes of driving scenes, enhances the perception of driving behavior, and significantly improves the detection of traffic anomalies. In addition, to better perceive various types of traffic anomalies, we carefully design an attentive anomaly focusing mechanism that visually and linguistically guides the model to adaptively focus on the visual context of interest, thereby facilitating the detection of traffic anomalies. It is shown that our proposed TTHF achieves promising performance, outperforming state-of-the-art competitors by +5.4% AUC on the DoTA dataset and achieving high generalization on the DADA dataset.
翻译:行车视频中的交通异常检测(TAD)对于保障自动驾驶和高级驾驶辅助系统的安全性至关重要。以往的基于单阶段TAD方法主要依赖帧预测,易受车载摄像头快速移动引起的动态背景干扰。虽然两阶段TAD方法通过利用感知算法预提取与背景无关的特征(如边界框和光流)来缓解此类干扰,但其性能受限于第一阶段感知算法的效果,且可能导致误差传播。本文介绍了TTHF——一种将视频片段与文本提示对齐的新型单阶段方法,为交通异常检测提供了新视角。与先前方法不同,本文方法的监督信号源于语言而非正交独热向量,从而提供更全面的表征。针对视觉表征,我们提出在时间域内对行车视频的高频成分进行建模。该建模捕捉了驾驶场景的动态变化,增强了对驾驶行为的感知,并显著提升了交通异常检测性能。此外,为更好感知各类交通异常,我们精心设计了注意力异常聚焦机制,通过视觉和语言引导模型自适应聚焦于感兴趣的视觉上下文,从而促进交通异常检测。实验表明,所提出的TTHF方法取得了优异性能:在DoTA数据集上,比当前最优方法AUC提升5.4%,且在DADA数据集上展现出高泛化能力。