We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a powerful enough generative model as our imitation learner, pure supervised behavior cloning can generate trajectories matching the per-time step distribution of essentially arbitrary expert trajectories in an optimal transport cost. Our analysis relies on a stochastic continuity property of the learned policy we call "total variation continuity" (TVC). We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations, and discussing implications for future research directions for better behavior cloning with generative modeling.
翻译:我们提出了一个理论框架,用于研究利用生成式建模对复杂专家演示进行行为克隆。该框架引入低层控制器(无论是学习的还是隐含于位置指令控制中的)来稳定围绕专家演示的模仿过程。我们证明,在满足(a)合适的低层稳定性保证和(b)足够强大的生成式模型作为模仿学习器的条件下,纯监督式行为克隆能够生成轨迹,其每个时间步的分布与本质上任意专家轨迹的分布相匹配(以最优传输成本度量)。我们的分析依赖于所学策略的一种随机连续性属性,即“全变差连续性”(TVC)。进一步表明,通过将一种流行的数据增强方案与新颖的算法技巧——在执行时添加增强噪声——相结合,可以在最小化精度损失的前提下确保TVC属性。我们将此保证实例化到由扩散模型参数化的策略上,并证明:若学习器能准确估计(经噪声增强的)专家策略的得分函数,则模仿者轨迹的分布与演示者分布在自然的最优传输距离上相近。我们的分析构建了噪声增强轨迹之间的复杂耦合,该技术可能具有独立的研究价值。最后,我们通过实验验证了算法建议,并探讨了未来研究方向对生成式建模实现更优行为克隆的启示。