The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.
翻译:从车载视频流中识别危险驾驶行为对于提升道路安全、支持交通违法行为检测及不安全驾驶行为识别至关重要。然而,现有的时序动作定位技术在准确性与计算效率之间往往难以实现平衡。本研究开发并评估了一种专为驾驶员监控场景设计的时序动作定位框架,特别适用于周期性检测场景(如交通安全检查点或车队管理系统)。该方法采用两阶段流水线架构,结合基于VideoMAE的特征提取与增强型自掩码注意力检测器,并通过空间金字塔池化快速模块捕获多尺度时序特征。实验结果表明模型能力与效率间存在显著权衡:在特征提取阶段,ViT-Giant骨干网络以88.09%的Top-1测试准确率提供更高表征能力,而基于ViT的变体以显著更低的计算微调成本(101.85 GFLOPs/片段 vs. Giant模型的1584.06 GFLOPs/片段)实现82.55%准确率,成为实用替代方案。在下游定位任务中,SPPF模块的集成持续提升所有配置的性能表现。值得注意的是,ViT-Giant + SPPF模型实现92.67%的最高平均精度均值,而轻量级ViT配置仍保持稳健结果。