One of the bottlenecks of training autonomous vehicle (AV) agents is the variability of training environments. Since learning optimal policies for unseen environments is often very costly and requires substantial data collection, it becomes computationally intractable to train the agent on every possible environment or task the AV may encounter. This paper introduces a zero-shot filtering approach to interpolate learned policies of past experiences to generalize to unseen ones. We use an experience kernel to correlate environments. These correlations are then exploited to produce policies for new tasks or environments from learned policies. We demonstrate our methods on an autonomous vehicle driving through T-intersections with different characteristics, where its behavior is modeled as a partially observable Markov decision process (POMDP). We first construct compact representations of learned policies for POMDPs with unknown transition functions given a dataset of sequential actions and observations. Then, we filter parameterized policies of previously visited environments to generate policies to new, unseen environments. We demonstrate our approaches on both an actual AV and a high-fidelity simulator. Results indicate that our experience filter offers a fast, low-effort, and near-optimal solution to create policies for tasks or environments never seen before. Furthermore, the generated new policies outperform the policy learned using the entire data collected from past environments, suggesting that the correlation among different environments can be exploited and irrelevant ones can be filtered out.
翻译:训练自动驾驶车辆(AV)代理的瓶颈之一在于训练环境的多样性。由于为未知环境学习最优策略通常成本高昂且需要大量数据收集,让代理在所有可能遇到的环境或任务上进行训练在计算上变得不可行。本文提出了一种零样本滤波方法,通过插值过往经验的学习策略来泛化到未知场景。我们使用经验核来关联不同环境,并利用这些相关性基于已学习策略为新任务或环境生成策略。我们在具有不同特征且行为被建模为部分可观测马尔可夫决策过程(POMDP)的T形交叉路口自动驾驶车辆上验证了该方法。首先,在给定序列动作和观测数据集且转移函数未知的情况下,我们为POMDP构建了已学习策略的紧凑表示。随后,对先前访问环境的参数化策略进行滤波,以生成针对新未知环境的策略。我们在真实AV和高保真模拟器上均验证了所提方法。结果表明,经验滤波器能为从未见过的任务或环境快速、低开销且近乎最优地生成策略。此外,生成的新策略优于利用过往环境全部数据训练得到的策略,这表明不同环境间的相关性可被利用,不相关因素则能被过滤掉。