We propose a theoretical framework for studying the imitation of stochastic, non-Markovian, potentially multi-modal (i.e. "complex" ) expert demonstrations in nonlinear dynamical systems. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation policies around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a stochastic continuity property of the learned policy we call "total variation continuity" (TVC), an imitator that accurately estimates actions on the demonstrator's state distribution closely matches the demonstrator's distribution over entire trajectories. We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations.
翻译:我们提出了一个理论框架,用于研究非线性动力系统中对随机、非马尔可夫、可能多模态(即“复杂”)专家演示的模仿。该框架借助低层控制器(无论是学习得到的还是隐含在位置指令控制中的)来稳定围绕专家演示的模仿策略。我们证明,若满足(a)合适的低层稳定性保证,以及(b)所学策略满足我们称之为“全变差连续性”(TVC)的随机连续性性质,那么一个能准确估计演示者状态分布上动作的模仿者,就能在完整轨迹分布上与演示者高度匹配。进一步,我们展示了通过将一种流行的数据增强策略与一种新颖的算法技巧相结合——即在执行时添加增强噪声,可以在几乎不损失精度的情况下确保TVC。我们为以扩散模型参数化的策略实例化了这一保证,并证明:若学习器能准确估计(噪声增强后的)专家策略的分数函数,则模仿者轨迹分布与演示者分布之间在一种自然的最优传输距离下将非常接近。我们的分析构建了噪声增强轨迹间的复杂耦合,这一技术可能具有独立的研究价值。最后,我们通过实验验证了所提算法建议的有效性。