Recognizing the activities causing distraction in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP's vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from driving video data. Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets. We propose both frame-based and video-based frameworks developed on top of the CLIP's visual representation for distracted driving detection and classification tasks and report the results.
翻译:识别真实驾驶场景中导致分心的活动对于保障驾驶员和行人的道路安全与可靠性至关重要。传统计算机视觉技术通常数据密集,需要大量带标注的训练数据来检测和分类各类分心驾驶行为,这限制了其效率与可扩展性。我们旨在开发一个通用框架,在标注训练数据有限或缺失的情况下仍能展现出稳健性能。近年来,视觉-语言模型通过大规模视觉-文本预训练,可适配分心驾驶活动识别等任务特定学习。诸如CLIP等视觉-语言预训练模型,在学习自然语言引导的视觉表征方面展现出显著潜力。本文提出一种基于CLIP的驾驶员活动识别方法,从自然驾驶图像与视频中识别驾驶员分心状态。CLIP的视觉嵌入支持零样本迁移与任务级微调,可对驾驶视频数据中的分心活动进行分类。实验结果表明,该框架在零样本迁移与基于视频的CLIP应用中,在两个公开数据集上均实现了驾驶员状态预测的最优性能。我们分别提出了基于图像帧与基于视频的框架,这些框架构建于CLIP视觉表征之上,用于分心驾驶的检测与分类任务,并报告了相关结果。