MRIC: Model-Based Reinforcement-Imitation Learning with Mixture-of-Codebooks for Autonomous Driving Simulation

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Accurately simulating diverse behaviors of heterogeneous agents in various scenarios is fundamental to autonomous driving simulation. This task is challenging due to the multi-modality of behavior distribution, the high-dimensionality of driving scenarios, distribution shift, and incomplete information. Our first insight is to leverage state-matching through differentiable simulation to provide meaningful learning signals and achieve efficient credit assignment for the policy. This is demonstrated by revealing the existence of gradient highways and interagent gradient pathways. However, the issues of gradient explosion and weak supervision in low-density regions are discovered. Our second insight is that these issues can be addressed by applying dual policy regularizations to narrow the function space. Further considering diversity, our third insight is that the behaviors of heterogeneous agents in the dataset can be effectively compressed as a series of prototype vectors for retrieval. These lead to our model-based reinforcement-imitation learning framework with temporally abstracted mixture-of-codebooks (MRIC). MRIC introduces the open-loop modelbased imitation learning regularization to stabilize training, and modelbased reinforcement learning (RL) regularization to inject domain knowledge. The RL regularization involves differentiable Minkowskidifference-based collision avoidance and projection-based on-road and traffic rule compliance rewards. A dynamic multiplier mechanism is further proposed to eliminate the interference from the regularizations while ensuring their effectiveness. Experimental results using the largescale Waymo open motion dataset show that MRIC outperforms state-ofthe-art baselines on diversity, behavioral realism, and distributional realism, with large margins on some key metrics (e.g., collision rate, minSADE, and time-to-collision JSD).

翻译：精准模拟异构智能体在不同场景中的多样化行为是自动驾驶仿真的基础。该任务因行为分布的多模态性、驾驶场景的高维性、分布偏移及信息不完整而具有挑战性。我们的第一个洞见是通过可微仿真利用状态匹配提供有意义的学习信号并实现策略的高效信用分配，这通过揭示梯度高速公路与智能体间梯度路径的存在性得以验证。然而，梯度爆炸与低密度区域的弱监督问题被发现。第二个洞见是通过双重策略正则化缩小函数空间可解决上述问题。进一步考虑多样性，第三个洞见是数据集中异构智能体的行为可被有效压缩为一系列用于检索的原型向量。基于此，我们提出带时间抽象多码本混合的基于模型的强化-模仿学习框架（MRIC）。MRIC引入开环基于模型的模仿学习正则化以稳定训练，以及基于模型的强化学习（RL）正则化以注入领域知识。RL正则化包含基于可微明可夫斯基差法的碰撞规避与基于投影的道路合规与交通规则奖励。进一步提出动态乘数机制以消除正则化产生的干扰同时确保其有效性。基于大规模Waymo开放运动数据集的实验表明，MRIC在多样性、行为真实性与分布真实性方面超越最先进基线，并在关键指标（如碰撞率、minSADE与碰撞时间JSD）上大幅领先。