Behavioral cloning (BC) can recover a good policy from abundant expert data, but may fail when expert data is insufficient. This paper considers a situation where, besides the small amount of expert data, a supplementary dataset is available, which can be collected cheaply from sub-optimal policies. Imitation learning with a supplementary dataset is an emergent practical framework, but its theoretical foundation remains under-developed. To advance understanding, we first investigate a direct extension of BC, called NBCU, that learns from the union of all available data. Our analysis shows that, although NBCU suffers an imitation gap that is larger than BC in the worst case, there exist special cases where NBCU performs better than or equally well as BC. This discovery implies that noisy data can also be helpful if utilized elaborately. Therefore, we further introduce a discriminator-based importance sampling technique to re-weight the supplementary data, proposing the WBCU method. With our newly developed landscape-based analysis, we prove that WBCU can outperform BC in mild conditions. Empirical studies show that WBCU simultaneously achieves the best performance on two challenging tasks where prior state-of-the-art methods fail.
翻译:行为克隆(BC)可以从大量专家数据中恢复出良好策略,但在专家数据不足时可能失败。本文考虑一种场景:除少量专家数据外,还存在一个可廉价从次优策略收集的补充数据集。利用补充数据集进行模仿学习已成为一种新兴实用框架,但其理论基础仍不完善。为促进理解,我们首先研究BC的直接扩展方法NBCU,该方法联合学习所有可用数据。分析表明,尽管NBCU在最坏情况下存在比BC更大的模仿差距,但存在NBCU表现优于或等同于BC的特例。这一发现表明,若精心利用,噪声数据亦能发挥作用。因此,我们进一步引入基于判别器的重要性采样技术来对补充数据重新加权,提出WBCU方法。通过我们新发展的基于景观的分析方法,我们证明WBCU在温和条件下可超越BC。实证研究表明,在先前的顶尖方法失效的两项具有挑战性任务上,WBCU同时取得了最佳性能。