Training intelligent agents to navigate highly interactive environments presents significant challenges. While guided meta reinforcement learning (RL) approach that first trains a guiding policy to train the ego agent has proven effective in improving generalizability across various levels of interaction, the state-of-the-art method tends to be overly sensitive to extreme cases, impairing the agents' performance in the more common scenarios. This study introduces a novel training framework that integrates guided meta RL with importance sampling (IS) to optimize training distributions for navigating highly interactive driving scenarios, such as T-intersections. Unlike traditional methods that may underrepresent critical interactions or overemphasize extreme cases during training, our approach strategically adjusts the training distribution towards more challenging driving behaviors using IS proposal distributions and applies the importance ratio to de-bias the result. By estimating a naturalistic distribution from real-world datasets and employing a mixture model for iterative training refinements, the framework ensures a balanced focus across common and extreme driving scenarios. Experiments conducted with both synthetic dataset and T-intersection scenarios from the InD dataset demonstrate not only accelerated training but also improvement in agent performance under naturalistic conditions, showcasing the efficacy of combining IS with meta RL in training reliable autonomous agents for highly interactive navigation tasks.
翻译:在高交互环境中训练智能导航智能体面临重大挑战。虽然现有引导式元强化学习方法(先训练引导策略再训练主体智能体)能有效提升智能体在不同交互强度下的泛化能力,但当前最优方法对极端场景过于敏感,反而损害了智能体在常见场景中的性能。本研究提出一种创新训练框架,将引导式元强化学习与重要性采样相结合,以优化高交互驾驶场景(如T型路口)的训练数据分布。与传统方法在训练中可能忽略关键交互或过度强调极端案例不同,本方法通过重要性采样建议分布策略性地将训练分布调整至更具挑战性的驾驶行为,并利用重要性比率消除结果偏差。通过从真实世界数据集中估计自然驾驶分布,并采用混合模型进行迭代训练优化,该框架确保了对常见与极端驾驶场景的均衡关注。在合成数据集和InD数据集T型路口场景上的实验表明,该方法不仅加速了训练过程,更提升了智能体在自然驾驶条件下的性能,证明了重要性采样与元强化学习相结合在训练高交互导航任务可靠自主智能体方面的有效性。