On end-to-end driving, a large amount of expert driving demonstrations is used to train an agent that mimics the expert by predicting its control actions. This process is self-supervised on vehicle signals (e.g., steering angle, acceleration) and does not require extra costly supervision (human labeling). Yet, the improvement of existing self-supervised end-to-end driving models has mostly given room to modular end-to-end models where labeling data intensive format such as semantic segmentation are required during training time. However, we argue that the latest self-supervised end-to-end models were developed in sub-optimal conditions with low-resolution images and no attention mechanisms. Further, those models are confined with limited field of view and far from the human visual cognition which can quickly attend far-apart scene features, a trait that provides an useful inductive bias. In this context, we present a new end-to-end model, trained by self-supervised imitation learning, leveraging a large field of view and a self-attention mechanism. These settings are more contributing to the agent's understanding of the driving scene, which brings a better imitation of human drivers. With only self-supervised training data, our model yields almost expert performance in CARLA's Nocrash metrics and could be rival to the SOTA models requiring large amounts of human labeled data. To facilitate further research, our code will be released.
翻译:在端到端驾驶中,大量专家驾驶示范被用于训练一个模仿专家行为的智能体,通过预测其控制动作来实现。这一过程对车辆信号(如转向角、加速度)进行自监督学习,无需额外的高成本人工标注。然而,现有自监督端到端驾驶模型的改进大多局限于模块化端到端模型——这类模型在训练时需要借助语义分割等密集标注数据格式。但我们认为,最新的自监督端到端模型是在次优条件下开发的,其使用了低分辨率图像且缺乏注意力机制。此外,这些模型视野受限,远不及人类视觉认知能力——人类能够快速关注到空间上相距较远的场景特征,这种能力提供了有益的归纳偏置。基于此,我们提出一种新的端到端模型,通过自监督模仿学习训练,充分利用大视角和自注意力机制。这些设置更有利于智能体对驾驶场景的理解,从而更好地模仿人类驾驶员。仅使用自监督训练数据,我们的模型在CARLA的Nocrash指标上达到近乎专家的性能,可与需要大量人工标注数据的先进模型相媲美。为促进进一步研究,我们将开源代码。