The utilization of Large Language Models (LLMs) within the realm of reinforcement learning, particularly as planners, has garnered a significant degree of attention in recent scholarly literature. However, a substantial proportion of existing research predominantly focuses on planning models for robotics that transmute the outputs derived from perception models into linguistic forms, thus adopting a `pure-language' strategy. In this research, we propose a hybrid End-to-End learning framework for autonomous driving by combining basic driving imitation learning with LLMs based on multi-modality prompt tokens. Instead of simply converting perception results from the separated train model into pure language input, our novelty lies in two aspects. 1) The end-to-end integration of visual and LiDAR sensory input into learnable multi-modality tokens, thereby intrinsically alleviating description bias by separated pre-trained perception models. 2) Instead of directly letting LLMs drive, this paper explores a hybrid setting of letting LLMs help the driving model correct mistakes and complicated scenarios. The results of our experiments suggest that the proposed methodology can attain driving scores of 49.21%, coupled with an impressive route completion rate of 91.34% in the offline evaluation conducted via CARLA. These performance metrics are comparable to the most advanced driving models.
翻译:近年来,大语言模型在强化学习领域,尤其是作为规划器的应用,在学术文献中获得了广泛关注。然而,现有研究大多集中于机器人规划模型,这些模型将感知模型的输出转换为语言形式,从而采用一种“纯语言”策略。在本研究中,我们提出了一种混合式端到端学习框架用于自动驾驶,该框架将基础的驾驶模仿学习与基于多模态提示令牌的大语言模型相结合。我们的创新之处在于两个方面,而非简单地将分离训练模型的感知结果转换为纯语言输入。1) 将视觉与激光雷达的传感器输入端到端地集成到可学习的多模态令牌中,从而从本质上减轻了由分离的预训练感知模型带来的描述偏差。2) 本文并未直接让大语言模型进行驾驶,而是探索了一种混合设置,让大语言模型帮助驾驶模型纠正错误并处理复杂场景。我们的实验结果表明,所提出的方法在通过CARLA进行的离线评估中,可以达到49.21%的驾驶分数,以及91.34%的出色路线完成率。这些性能指标与最先进的驾驶模型相当。