SPACeR：基于中心化参考模型的自博弈锚定方法 (SPACeR: Self-Play Anchoring with Centralized Reference Models)

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10x faster at inference and 50x smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.

翻译：自动驾驶车辆（AVs）的开发不仅需要安全性和效率，还需要具备社会意识、可预测且拟人化的真实行为。实现这一目标需要能够在多智能体环境中产生拟人化、快速且可扩展的模拟智能体策略。近期基于大型扩散模型或分词化模型的模仿学习进展表明，可以直接从人类驾驶数据中捕捉行为，从而生成真实的策略。然而，这些模型计算成本高昂、推理速度缓慢，且在反应式闭环场景中难以适应。相比之下，自博弈强化学习（RL）能够高效扩展并自然地捕捉多智能体交互，但它通常依赖于启发式方法和奖励塑形，且生成的策略可能偏离人类行为规范。我们提出了SPACeR框架，该框架利用预训练的分词化自回归运动模型作为中心化参考策略来指导去中心化的自博弈。参考模型提供似然奖励和KL散度，将策略锚定在人类驾驶分布上，同时保持强化学习的可扩展性。在Waymo模拟智能体挑战赛上的评估表明，我们的方法取得了与模仿学习策略相竞争的性能，同时推理速度比大型生成模型快达10倍，参数量小50倍。此外，我们在闭环自我规划评估任务中证明，我们的模拟智能体能够通过快速可扩展的交通仿真有效衡量规划器质量，从而为测试自动驾驶策略建立了新范式。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日