We propose a theoretical framework for studying the imitation of stochastic, non-Markovian, potentially multi-modal (i.e. "complex" ) expert demonstrations in nonlinear dynamical systems. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation policies around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a stochastic continuity property of the learned policy we call "total variation continuity" (TVC), an imitator that accurately estimates actions on the demonstrator's state distribution closely matches the demonstrator's distribution over entire trajectories. We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations.
翻译:我们提出了一个理论框架,用于研究非线性动力系统中对随机、非马尔可夫、可能多模态(即“复杂”)专家演示的模仿。该框架利用低级控制器(无论是学习得到的还是隐式存在于位置命令控制中)来稳定围绕专家演示的模仿策略。我们证明,在满足(a)适当的低级稳定性保证以及(b)我们称为“总变差连续性”(TVC)的学习策略随机连续性性质时,能够准确估计演示者状态分布上动作的模仿者,将紧密匹配演示者在整个轨迹上的分布。随后,我们展示了一种结合流行数据增强策略与新算法技巧的方法——在执行时添加增强噪声——可以在最小化精度损失的前提下确保TVC。我们针对由扩散模型参数化的策略实例化了我们的保证,并证明:若学习器准确估计了(噪声增强的)专家策略的分数,则模仿者轨迹的分布在一种自然的最优传输距离下与演示者分布相近。我们的分析构建了噪声增强轨迹间的复杂耦合,这一技术可能具有独立的研究价值。最后,我们通过实验验证了我们的算法建议。