A Novel Driver Distraction Behavior Detection Based on Self-Supervised Learning Framework with Masked Image Modeling

Driver distraction causes a significant number of traffic accidents every year, resulting in economic losses and casualties. Currently, the level of automation in commercial vehicles is far from completely unmanned, and drivers still play an important role in operating and controlling the vehicle. Therefore, driver distraction behavior detection is crucial for road safety. At present, driver distraction detection primarily relies on traditional Convolutional Neural Networks (CNN) and supervised learning methods. However, there are still challenges such as the high cost of labeled datasets, limited ability to capture high-level semantic information, and weak generalization performance. In order to solve these problems, this paper proposes a new self-supervised learning method based on masked image modeling for driver distraction behavior detection. Firstly, a self-supervised learning framework for masked image modeling (MIM) is introduced to solve the serious human and material consumption issues caused by dataset labeling. Secondly, the Swin Transformer is employed as an encoder. Performance is enhanced by reconfiguring the Swin Transformer block and adjusting the distribution of the number of window multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA) detection heads across all stages, which leads to model more lightening. Finally, various data augmentation strategies are used along with the best random masking strategy to strengthen the model's recognition and generalization ability. Test results on a large-scale driver distraction behavior dataset show that the self-supervised learning method proposed in this paper achieves an accuracy of 99.60%, approximating the excellent performance of advanced supervised learning methods.

翻译：驾驶员分心行为每年导致大量交通事故，造成经济损失和人员伤亡。当前商用车辆的自动化程度远未达到完全无人驾驶水平，驾驶员仍在车辆操作与控制中扮演关键角色。因此，驾驶员分心行为检测对道路安全至关重要。现阶段驾驶员分心检测主要依赖传统卷积神经网络（CNN）和监督学习方法，但仍面临标注数据集成本高昂、高层次语义信息捕捉能力有限以及泛化性能薄弱等挑战。针对上述问题，本文提出一种基于掩码图像建模的新型自监督学习方法用于驾驶员分心行为检测。首先，引入掩码图像建模（MIM）自监督学习框架，以解决数据集标注带来的人力和物力资源严重消耗问题；其次，采用Swin Transformer作为编码器，通过重构Swin Transformer模块并调整各阶段窗口多头自注意力（W-MSA）与滑动窗口多头自注意力（SW-MSA）检测头数量的分布，使模型更加轻量化；最后，结合多种数据增强策略与最优随机掩码策略，增强模型的识别能力与泛化性能。在大规模驾驶员分心行为数据集上的测试结果表明，本文提出的自监督学习方法达到了99.60%的准确率，接近先进监督学习方法的优异性能。