To address the challenges of sensor fusion and safety risk prediction, contemporary closed-loop autonomous driving neural networks leveraging imitation learning typically require a substantial volume of parameters and computational resources to run neural networks. Given the constrained computational capacities of onboard vehicular computers, we introduce a compact yet potent solution named EfficientFuser. This approach employs EfficientViT for visual information extraction and integrates feature maps via cross attention. Subsequently, it utilizes a decoder-only transformer for the amalgamation of multiple features. For prediction purposes, learnable vectors are embedded as tokens to probe the association between the task and sensor features through attention. Evaluated on the CARLA simulation platform, EfficientFuser demonstrates remarkable efficiency, utilizing merely 37.6% of the parameters and 8.7% of the computations compared to the state-of-the-art lightweight method with only 0.4% lower driving score, and the safety score neared that of the leading safety-enhanced method, showcasing its efficacy and potential for practical deployment in autonomous driving systems.
翻译:为解决传感器融合与安全风险预测的挑战,当前基于模仿学习的闭环自动驾驶神经网络通常需要大量参数与计算资源来运行神经网络。考虑到车载计算机有限的计算能力,我们提出了一种紧凑而高效的解决方案,命名为EfficientFuser。该方法采用EfficientViT进行视觉信息提取,并通过交叉注意力机制融合特征图。随后,使用仅含解码器的Transformer实现多特征融合。为进行预测,该方法将可学习向量作为标记嵌入,通过注意力机制探索任务与传感器特征之间的关联。在CARLA仿真平台上的评估表明,EfficientFuser展现出卓越的效率:与当前最先进的轻量化方法相比,仅使用其37.6%的参数和8.7%的计算量,驾驶评分仅降低0.4%,安全评分接近领先的安全增强方法,彰显了其在自动驾驶系统中实际部署的有效性与潜力。