The integration of Large Language Models (LLMs) into autonomous driving systems demonstrates strong common sense and reasoning abilities, effectively addressing the pitfalls of purely data-driven methods. Current LLM-based agents require lengthy inference times and face challenges in interacting with real-time autonomous driving environments. A key open question is whether we can effectively leverage the knowledge from LLMs to train an efficient and robust Reinforcement Learning (RL) agent. This paper introduces RAPID, a novel \underline{\textbf{R}}obust \underline{\textbf{A}}daptive \underline{\textbf{P}}olicy \underline{\textbf{I}}nfusion and \underline{\textbf{D}}istillation framework, which trains specialized mix-of-policy RL agents using data synthesized by an LLM-based driving agent and online adaptation. RAPID features three key designs: 1) utilization of offline data collected from an LLM agent to distil expert knowledge into RL policies for faster real-time inference; 2) introduction of robust distillation in RL to inherit both performance and robustness from LLM-based teacher; and 3) employment of a mix-of-policy approach for joint decision decoding with a policy adapter. Through fine-tuning via online environment interaction, RAPID reduces the forgetting of LLM knowledge while maintaining adaptability to different tasks. Extensive experiments demonstrate RAPID's capability to effectively integrate LLM knowledge into scaled-down RL policies in an efficient, adaptable, and robust way. Code and checkpoints will be made publicly available upon acceptance.
翻译:将大型语言模型(LLM)集成到自动驾驶系统中,展现了强大的常识与推理能力,有效弥补了纯数据驱动方法的不足。然而,当前基于LLM的智能体需要较长的推理时间,且在与实时自动驾驶环境交互时面临挑战。一个关键且尚未解决的问题是:我们能否有效利用LLM的知识来训练一个高效且鲁棒的强化学习(RL)智能体?本文提出RAPID,一种新颖的鲁棒自适应策略融合与蒸馏框架,该框架利用基于LLM的驾驶智能体合成的数据及在线自适应机制,训练专用的混合策略RL智能体。RAPID包含三个核心设计:1)利用从LLM智能体收集的离线数据,将专家知识蒸馏至RL策略中,以实现更快的实时推理;2)在RL中引入鲁棒蒸馏机制,以继承基于LLM的教师模型的性能与鲁棒性;3)采用混合策略方法,结合策略适配器进行联合决策解码。通过在线环境交互进行微调,RAPID在保持对不同任务适应性的同时,减少了对LLM知识的遗忘。大量实验表明,RAPID能够以高效、自适应且鲁棒的方式,将LLM知识有效整合到规模缩减的RL策略中。代码与模型检查点将在论文录用后公开。