CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving

End-to-end autonomous driving models trained with imitation learning (IL) often generalize poorly, particularly in long-tail scenarios where expert demonstrations are sparse. Reinforcement learning (RL) can provide complementary task-level supervision, but applying RL to real-world autonomous driving is challenging in offline settings without interactive simulators, where datasets are dominated by expert actions and provide limited behavioral diversity. We propose CoIRL-AD, a competitive dual-policy framework that integrates IL and RL under a unified offline training regime. CoIRL-AD decouples imitation and reward optimization into separate actors to alleviate objective conflicts, uses imagined future rollouts for long-horizon reward estimation, and introduces a competition mechanism that selectively transfers beneficial behaviors while keeping RL anchored to expert-like driving. Experiments on the nuScenes benchmark show that CoIRL-AD consistently improves robustness over strong IL-based baselines, with especially large gains in cross-city generalization and long-tail scenarios. Code is available at: https://github.com/SEU-zxj/CoIRL-AD.

翻译：基于模仿学习的端到端自动驾驶模型通常泛化能力较差，尤其在专家演示稀疏的长尾场景中表现不佳。强化学习能提供互补的任务级监督，但在无交互模拟器的离线环境下，将其应用于真实世界自动驾驶面临挑战——此时数据集以专家动作为主，行为多样性有限。本文提出CoIRL-AD，一种在统一离线训练框架下整合模仿学习与强化学习的竞争型双策略框架。CoIRL-AD将模仿与奖励优化解耦为独立智能体以缓解目标冲突，利用想象未来推演进行长时域奖励估计，并引入竞争机制选择性迁移有益行为，同时使强化学习锚定在类似专家的驾驶模式上。在nuScenes基准上的实验表明，CoIRL-AD在强模仿学习基线基础上持续提升鲁棒性，尤其在跨城市泛化和长尾场景中取得显著提升。代码已开源：https://github.com/SEU-zxj/CoIRL-AD。

相关内容

模仿学习

关注 324

模仿学习是学习尝试模仿专家行为从而获取最佳性能的一系列任务。目前主流方法包括监督式模仿学习、随机混合迭代学习和数据聚合模拟学习等方法。模仿学习（Imitation Learning）背后的原理是是通过隐含地给学习器关于这个世界的先验信息，比如执行、学习人类行为。在模仿学习任务中，智能体（agent）为了学习到策略从而尽可能像人类专家那样执行一种行为，它会寻找一种最佳的方式来使用由该专家示范的训练集（输入-输出对）。当智能体学习人类行为时，虽然我们也需要使用模仿学习，但实时的行为模拟成本会非常高。与之相反，吴恩达提出的学徒学习（Apprenticeship learning）执行的是存粹的贪婪/利用（exploitative）策略，并使用强化学习方法遍历所有的（状态和行为）轨迹（trajectories）来学习近优化策略。它需要极难的计略（maneuvers），而且几乎不可能从未观察到的状态还原。模仿学习能够处理这些未探索到的状态，所以可为自动驾驶这样的许多任务提供更可靠的通用框架。

Agentic RL：框架、实践与长程智能体训练

专知会员服务

23+阅读 · 6月24日

【ICLR2025】AdaWM：基于自适应世界模型的自动驾驶规划

专知会员服务

16+阅读 · 2025年1月26日