Temporal localization of driving actions plays a crucial role in advanced driver-assistance systems and naturalistic driving studies. However, this is a challenging task due to strict requirements for robustness, reliability and accurate localization. In this work, we focus on improving the overall performance by efficiently utilizing video action recognition networks and adapting these to the problem of action localization. To this end, we first develop a density-guided label smoothing technique based on label probability distributions to facilitate better learning from boundary video-segments that typically include multiple labels. Second, we design a post-processing step to efficiently fuse information from video-segments and multiple camera views into scene-level predictions, which facilitates elimination of false positives. Our methodology yields a competitive performance on the A2 test set of the naturalistic driving action recognition track of the 2022 NVIDIA AI City Challenge with an F1 score of 0.271.
翻译:驾驶行为的时间定位在高级驾驶辅助系统和自然驾驶研究中发挥着关键作用。然而,由于对鲁棒性、可靠性和精确定位的严格要求,这是一项具有挑战性的任务。本文致力于通过高效利用视频动作识别网络并将其适配至动作定位问题来提升整体性能。为此,我们首先基于标签概率分布提出了一种密度引导的标签平滑技术,以促进对通常包含多标签的边界视频片段进行更有效的学习。其次,我们设计了一个后处理步骤,将视频片段与多视角信息高效融合至场景级预测,从而有助于消除误检。我们的方法在2022年NVIDIA AI城市挑战赛自然驾驶动作识别赛道A2测试集上取得了具有竞争力的性能,F1得分为0.271。