Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10x faster at inference and 50x smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.
翻译:自动驾驶车辆(AVs)的开发不仅需要安全性和效率,还需要具备社会意识、可预测且拟人化的真实行为。实现这一目标需要能够在多智能体环境中产生拟人化、快速且可扩展的模拟智能体策略。近期基于大型扩散模型或分词化模型的模仿学习进展表明,可以直接从人类驾驶数据中捕捉行为,从而生成真实的策略。然而,这些模型计算成本高昂、推理速度缓慢,且在反应式闭环场景中难以适应。相比之下,自博弈强化学习(RL)能够高效扩展并自然地捕捉多智能体交互,但它通常依赖于启发式方法和奖励塑形,且生成的策略可能偏离人类行为规范。我们提出了SPACeR框架,该框架利用预训练的分词化自回归运动模型作为中心化参考策略来指导去中心化的自博弈。参考模型提供似然奖励和KL散度,将策略锚定在人类驾驶分布上,同时保持强化学习的可扩展性。在Waymo模拟智能体挑战赛上的评估表明,我们的方法取得了与模仿学习策略相竞争的性能,同时推理速度比大型生成模型快达10倍,参数量小50倍。此外,我们在闭环自我规划评估任务中证明,我们的模拟智能体能够通过快速可扩展的交通仿真有效衡量规划器质量,从而为测试自动驾驶策略建立了新范式。