Driver distraction is a principal cause of traffic accidents. In a study conducted by the National Highway Traffic Safety Administration, engaging in activities such as interacting with in-car menus, consuming food or beverages, or engaging in telephonic conversations while operating a vehicle can be significant sources of driver distraction. From this viewpoint, this paper introduces a novel method for detection of driver distraction using multi-view driver action images. The proposed method is a vision transformer-based framework with pose estimation and action inference, namely PoseViNet. The motivation for adding posture information is to enable the transformer to focus more on key features. As a result, the framework is more adept at identifying critical actions. The proposed framework is compared with various state-of-the-art models using SFD3 dataset representing 10 behaviors of drivers. It is found from the comparison that the PoseViNet outperforms these models. The proposed framework is also evaluated with the SynDD1 dataset representing 16 behaviors of driver. As a result, the PoseViNet achieves 97.55% validation accuracy and 90.92% testing accuracy with the challenging dataset.
翻译:驾驶员分心是导致交通事故的主要原因。美国国家公路交通安全管理局的研究表明,在驾驶过程中操作车载菜单、饮食或进行电话交谈等活动,均可能成为重要的分心来源。基于此,本文提出一种利用多视角驾驶员动作图像检测分心驾驶行为的新方法。该方法构建了基于视觉Transformer的框架,融合姿态估计与动作推理,命名为PoseViNet。引入姿态信息的动机在于使Transformer更聚焦于关键特征,从而提升框架对关键动作的识别能力。通过与多种先进模型在包含10种驾驶员行为的SFD3数据集上进行对比,PoseViNet展现出更优性能。此外,该框架在包含16种驾驶员行为的SynDD1数据集上评估显示,在极具挑战性的数据集上实现了97.55%的验证准确率与90.92%的测试准确率。