Recognizing the activities, causing distraction, in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP's vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from driving video data. Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets. We propose both frame-based and video-based frameworks developed on top of the CLIP's visual representation for distracted driving detection and classification task and report the results.
翻译:识别真实驾驶场景中导致分心的行为对于确保驾驶员和行人的道路安全与可靠性至关重要。传统的计算机视觉技术通常需要大量标注训练数据来检测和分类各类分心驾驶行为,从而限制了其效率和可扩展性。本研究旨在开发一种通用框架,使其在有限标注数据或无标注数据条件下仍能展现稳健性能。近年来,视觉语言模型通过大规模视觉-文本预训练,可适配分心驾驶行为识别等特定任务学习。CLIP等视觉语言预训练模型在学习自然语言引导的视觉表征方面展现出显著潜力。本文提出基于CLIP的驾驶员行为识别方法,通过自然驾驶图像与视频识别驾驶员分心状态。CLIP的视觉嵌入支持零样本迁移与任务微调,可从驾驶视频数据中分类分心行为。实验结果表明,该框架在两个公开数据集上对驾驶员状态预测任务实现了零样本迁移与视频级CLIP的最新性能。我们分别在帧级和视频级构建基于CLIP视觉表征的分心驾驶检测分类框架,并报告相关结果。