CtRL-Sim: Reactive and Controllable Driving Agents with Offline Reinforcement Learning

Evaluating autonomous vehicle stacks (AVs) in simulation typically involves replaying driving logs from real-world recorded traffic. However, agents replayed from offline data do not react to the actions of the AV, and their behaviour cannot be easily controlled to simulate counterfactual scenarios. Existing approaches have attempted to address these shortcomings by proposing methods that rely on heuristics or learned generative models of real-world data but these approaches either lack realism or necessitate costly iterative sampling procedures to control the generated behaviours. In this work, we take an alternative approach and propose CtRL-Sim, a method that leverages return-conditioned offline reinforcement learning within a physics-enhanced Nocturne simulator to efficiently generate reactive and controllable traffic agents. Specifically, we process real-world driving data through the Nocturne simulator to generate a diverse offline reinforcement learning dataset, annotated with various reward terms. With this dataset, we train a return-conditioned multi-agent behaviour model that allows for fine-grained manipulation of agent behaviours by modifying the desired returns for the various reward components. This capability enables the generation of a wide range of driving behaviours beyond the scope of the initial dataset, including those representing adversarial behaviours. We demonstrate that CtRL-Sim can efficiently generate diverse and realistic safety-critical scenarios while providing fine-grained control over agent behaviours. Further, we show that fine-tuning our model on simulated safety-critical scenarios generated by our model enhances this controllability.

翻译：评估自动驾驶系统（AVs）时，仿真测试通常基于真实交通记录中的驾驶日志回放。然而，从离线数据回放的智能体无法对自动驾驶系统的动作做出反应，且其行为难以被灵活控制以模拟反事实场景。现有方法试图通过基于启发式规则或学习真实数据生成模型来弥补这些缺陷，但这些方法要么缺乏真实性，要么需要昂贵的迭代采样流程来控制生成的行为。本文提出一种替代方案——CtRL-Sim，该方法利用物理增强的Nocturne仿真器中的返回条件化离线强化学习，高效生成兼具反应性和可控性的交通智能体。具体而言，我们通过Nocturne仿真器处理真实驾驶数据，构建包含多种奖励标注的多样化离线强化学习数据集。基于该数据集训练的返回条件化多智能体行为模型，可通过修改不同奖励分量的期望累积回报，实现对智能体行为的细粒度操控。该能力可生成超越原始数据集范围的多样化驾驶行为，包括对抗性行为。实验表明，CtRL-Sim能在提供细粒度行为控制的同时，高效生成多样且真实的安全关键场景。此外，通过在我们模型生成的安全关键场景上进行微调，可进一步增强这种可控性。