A Novel Driver Distraction Behavior Detection Based on Self-Supervised Learning Framework with Masked Image Modeling

Driver distraction causes a significant number of traffic accidents every year, resulting in economic losses and casualties. Currently, the level of automation in commercial vehicles is far from completely unmanned, and drivers still play an important role in operating and controlling the vehicle. Therefore, driver distraction behavior detection is crucial for road safety. At present, driver distraction detection primarily relies on traditional Convolutional Neural Networks (CNN) and supervised learning methods. However, there are still challenges such as the high cost of labeled datasets, limited ability to capture high-level semantic information, and weak generalization performance. In order to solve these problems, this paper proposes a new self-supervised learning method based on masked image modeling for driver distraction behavior detection. Firstly, a self-supervised learning framework for masked image modeling (MIM) is introduced to solve the serious human and material consumption issues caused by dataset labeling. Secondly, the Swin Transformer is employed as an encoder. Performance is enhanced by reconfiguring the Swin Transformer block and adjusting the distribution of the number of window multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA) detection heads across all stages, which leads to model more lightening. Finally, various data augmentation strategies are used along with the best random masking strategy to strengthen the model's recognition and generalization ability. Test results on a large-scale driver distraction behavior dataset show that the self-supervised learning method proposed in this paper achieves an accuracy of 99.60%, approximating the excellent performance of advanced supervised learning methods.

翻译：驾驶员分心每年导致大量交通事故，造成经济损失和人员伤亡。当前商用车自动化水平远未达到完全无人驾驶，驾驶员在车辆操作和控制中仍扮演重要角色。因此，驾驶员分心行为检测对道路安全至关重要。现有驾驶员分心检测主要依赖传统卷积神经网络和监督学习方法，但存在标注数据集成本高、高级语义信息捕获能力有限以及泛化性能弱等挑战。为解决这些问题，本文提出一种基于掩码图像建模的新型自监督学习方法用于驾驶员分心行为检测。首先，引入掩码图像建模的自监督学习框架，以解决数据集标注造成的人力和物力消耗问题；其次，采用Swin Transformer作为编码器，通过重构Swin Transformer模块并调整各阶段窗口多头自注意力和移位窗口多头自注意力检测头的数量分布，使模型更加轻量化；最后，结合多种数据增强策略与最佳随机掩码策略，增强模型的识别与泛化能力。在大规模驾驶员分心行为数据集上的测试结果表明，本文提出的自监督学习方法准确率达到99.60%，接近先进监督学习方法的优异性能。