Although deep reinforcement learning (DRL) has shown promising results for autonomous navigation in interactive traffic scenarios, existing work typically adopts a fixed behavior policy to control social vehicles in the training environment. This may cause the learned driving policy to overfit the environment, making it difficult to interact well with vehicles with different, unseen behaviors. In this work, we introduce an efficient method to train diverse driving policies for social vehicles as a single meta-policy. By randomizing the interaction-based reward functions of social vehicles, we can generate diverse objectives and efficiently train the meta-policy through guiding policies that achieve specific objectives. We further propose a training strategy to enhance the robustness of the ego vehicle's driving policy using the environment where social vehicles are controlled by the learned meta-policy. Our method successfully learns an ego driving policy that generalizes well to unseen situations with out-of-distribution (OOD) social agents' behaviors in a challenging uncontrolled T-intersection scenario.
翻译:尽管深度强化学习在交互式交通场景的自主导航中展现出良好前景,但现有工作通常在训练环境中采用固定行为策略控制社会车辆。这可能导致学习到的驾驶策略过度拟合环境,难以与具有不同未知行为的车辆良好交互。本研究提出一种高效方法,将社会车辆的多类驾驶策略训练为单一元策略。通过随机化社会车辆基于交互的奖励函数,我们能够生成多样化目标,并借助实现特定目标的引导策略高效训练元策略。我们进一步提出训练策略,利用由学习到的元策略控制社会车辆的环境增强自车驾驶策略的鲁棒性。在具有挑战性的无管制T型交叉口场景中,本方法成功学习到能够良好泛化至包含分布外社会智能体行为的未知情境的自车驾驶策略。