Many applications of imitation learning require the agent to generate the full distribution of behaviour observed in the training data. For example, to evaluate the safety of autonomous vehicles in simulation, accurate and diverse behaviour models of other road users are paramount. Existing methods that improve this distributional realism typically rely on hierarchical policies. These condition the policy on types such as goals or personas that give rise to multi-modal behaviour. However, such methods are often inappropriate for stochastic environments where the agent must also react to external factors: because agent types are inferred from the observed future trajectory during training, these environments require that the contributions of internal and external factors to the agent behaviour are disentangled and only internal factors, i.e., those under the agent's control, are encoded in the type. Encoding future information about external factors leads to inappropriate agent reactions during testing, when the future is unknown and types must be drawn independently from the actual future. We formalize this challenge as distribution shift in the conditional distribution of agent types under environmental stochasticity. We propose Robust Type Conditioning (RTC), which eliminates this shift with adversarial training under randomly sampled types. Experiments on two domains, including the large-scale Waymo Open Motion Dataset, show improved distributional realism while maintaining or improving task performance compared to state-of-the-art baselines.
翻译:许多模仿学习的应用要求智能体生成训练数据中观察到的完整行为分布。例如,在仿真中评估自动驾驶车辆的安全性时,其他道路使用者的准确且多样化的行为模型至关重要。现有提升这种分布真实性的方法通常依赖于层级策略。这些方法将策略条件设定为能够产生多模态行为的类型(如目标或个性)。然而,此类方法往往不适用于智能体还需应对外部因素的随机环境:因为在训练过程中,智能体类型是从观测到的未来轨迹推断而来,这类环境要求将内部因素与外部因素对智能体行为的贡献进行解耦,并仅将内部因素(即智能体可控制的因素)编码到类型中。编码与外部因素相关的未来信息会导致测试阶段智能体产生不恰当的反应,这是因为测试时未来状态未知,且类型必须独立于实际未来进行采样。我们将此挑战形式化为环境随机性下智能体类型条件分布的分布偏移。我们提出鲁棒类型条件化(RTC)方法,通过随机采样类型下的对抗训练消除这种偏移。在两个领域(包括大规模Waymo开放运动数据集)上的实验表明,与最先进的基线方法相比,该方法在维持或提升任务性能的同时,显著改善了分布真实性。