Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

End-to-end autonomous driving has made impressive progress in recent years. Existing methods usually adopt the decoupled encoder-decoder paradigm, where the encoder extracts hidden features from raw sensor data, and the decoder outputs the ego-vehicle's future trajectories or actions. Under such a paradigm, the encoder does not have access to the intended behavior of the ego agent, leaving the burden of finding out safety-critical regions from the massive receptive field and inferring about future situations to the decoder. Even worse, the decoder is usually composed of several simple multi-layer perceptrons (MLP) or GRUs while the encoder is delicately designed (e.g., a combination of heavy ResNets or Transformer). Such an imbalanced resource-task division hampers the learning process. In this work, we aim to alleviate the aforementioned problem by two principles: (1) fully utilizing the capacity of the encoder; (2) increasing the capacity of the decoder. Concretely, we first predict a coarse-grained future position and action based on the encoder features. Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly. We also retrieve the encoder features around the predicted coordinate to obtain fine-grained information about the safety-critical region. Finally, based on the predicted future and the retrieved salient feature, we refine the coarse-grained position and action by predicting its offset from ground-truth. The above refinement module could be stacked in a cascaded fashion, which extends the capacity of the decoder with spatial-temporal prior knowledge about the conditioned future. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance in closed-loop benchmarks. Extensive ablation studies demonstrate the effectiveness of each proposed module.

翻译：端到端自动驾驶近年来取得了显著进展。现有方法通常采用解耦的编码器-解码器范式，其中编码器从原始传感器数据中提取隐藏特征，而解码器输出自车的未来轨迹或动作。在此范式下，编码器无法获取自车智能体的意图行为，导致解码器需要从巨大的感受野中识别安全关键区域并推断未来情景。更严重的是，解码器通常仅由若干简单的多层感知机（MLP）或门控循环单元（GRU）构成，而编码器却经过精密设计（例如重型残差网络（ResNet）或 Transformer 的组合）。这种不均衡的资源-任务分配阻碍了学习过程。在本工作中，我们旨在通过两个原则缓解上述问题：（1）充分利用编码器的容量；（2）提升解码器的容量。具体而言，我们首先基于编码器特征预测粗粒度的未来位置和动作。随后，依据该位置和动作对未来场景进行想象，以验证按此方式行驶的潜在后果。同时，我们检索预测坐标周围的编码器特征，获取安全关键区域的细粒度信息。最后，基于预测的未来场景和检索到的显著性特征，通过预测其与真实值的偏移量来优化粗粒度的位置和动作。上述优化模块可级联堆叠，通过引入条件化未来的时空先验知识扩展解码器的容量。我们在CARLA模拟器上进行实验，在闭环基准测试中实现了最先进的性能。大量的消融实验证明了各提出模块的有效性。