SSIL: Self-Supervised Imitation Learning for End-to-End Driving

In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this paper proposes the first self-supervised learning framework, Self-Supervised Imitation Learning (SSIL), for E2E driving. The proposed SSIL framework can learn vision-based E2E driving networks without using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose a new cross-attention-based conditioning approach (CACA) for a vision encoder in E2E driving, where a high-level instruction serves as the conditioning signal for visual information. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves very comparable E2E driving accuracy with the supervised learning counterpart. Furthermore, the proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller, and proposed CACA achieved superior performance over existing conditioning approaches.

翻译：摘要：在自动驾驶领域，直接从传感器数据预测车辆控制信号的端到端（E2E）驾驶方法正迅速受到关注。为了学习安全的E2E驾驶系统，需要大量的驾驶数据和人工干预。车辆控制数据由数小时的人类驾驶构建而成，而构建大规模车辆控制数据集极具挑战性。通常，公开可用的驾驶数据集仅在有限的驾驶场景中收集，且车辆控制数据仅能由车辆制造商获取。为解决这些挑战，本文提出了首个用于E2E驾驶的自监督学习框架——自监督模仿学习（SSIL）。所提出的SSIL框架无需使用驾驶指令数据或预训练模型，即可学习基于视觉的E2E驾驶网络。为构建伪转向角数据，所提出的SSIL利用通过激光雷达传感器估计的车辆当前及先前时刻位姿来预测伪目标。此外，我们提出了一种基于交叉注意力机制的新型条件化方法（CACA），用于E2E驾驶中的视觉编码器，其中高层指令作为视觉信息的条件信号。基于三种不同基准数据集的数值实验表明，所提出的SSIL框架在E2E驾驶精度上与有监督学习方法具有高度可比性。此外，所提出的伪标签预测器优于现有使用比例积分微分控制器的方法，且所提出的CACA在现有条件化方法中实现了优越性能。