LeTFuser: Light-weight End-to-end Transformer-Based Sensor Fusion for Autonomous Driving with Multi-Task Learning

In end-to-end autonomous driving, the utilization of existing sensor fusion techniques for imitation learning proves inadequate in challenging situations that involve numerous dynamic agents. To address this issue, we introduce LeTFuser, a transformer-based algorithm for fusing multiple RGB-D camera representations. To perform perception and control tasks simultaneously, we utilize multi-task learning. Our model comprises of two modules, the first being the perception module that is responsible for encoding the observation data obtained from the RGB-D cameras. It carries out tasks such as semantic segmentation, semantic depth cloud mapping (SDC), and traffic light state recognition. Our approach employs the Convolutional vision Transformer (CvT) \cite{wu2021cvt} to better extract and fuse features from multiple RGB cameras due to local and global feature extraction capability of convolution and transformer modules, respectively. Following this, the control module undertakes the decoding of the encoded characteristics together with supplementary data, comprising a rough simulator for static and dynamic environments, as well as various measurements, in order to anticipate the waypoints associated with a latent feature space. We use two methods to process these outputs and generate the vehicular controls (e.g. steering, throttle, and brake) levels. The first method uses a PID algorithm to follow the waypoints on the fly, whereas the second one directly predicts the control policy using the measurement features and environmental state. We evaluate the model and conduct a comparative analysis with recent models on the CARLA simulator using various scenarios, ranging from normal to adversarial conditions, to simulate real-world scenarios. Our code is available at \url{https://github.com/pagand/e2etransfuser/tree/cvpr-w} to facilitate future studies.

翻译：在端到端自动驾驶中，针对涉及大量动态智能体的复杂场景，现有基于模仿学习的传感器融合技术表现不足。为解决此问题，我们提出LeTFuser——一种基于Transformer的多RGB-D相机表征融合算法。通过多任务学习，模型可同步执行感知与控制任务。模型由两大模块构成：感知模块负责编码RGB-D相机观测数据，执行语义分割、语义深度云映射（SDC）及交通灯状态识别等任务。该模块采用卷积视觉Transformer（CvT）\cite{wu2021cvt}，利用卷积与Transformer模块分别提取局部与全局特征的能力，对多RGB相机特征进行高效提取与融合。随后，控制模块将编码特征与辅助数据（含静态/动态环境粗模拟器及各类测量信息）联合解码，以预测隐特征空间中的路径点。我们通过两种方法处理这些输出并生成车辆控制指令（如转向、油门、制动）：方法一基于PID算法实时跟踪路径点，方法二则直接利用测量特征与环境状态预测控制策略。在CARLA模拟器中，我们构建从常规到对抗性条件的多样化场景进行模型评估，并与近期模型开展对比分析。代码已开源至\url{https://github.com/pagand/e2etransfuser/tree/cvpr-w}，以促进后续研究。