We propose a theoretical framework for studying behavior cloning stochastic, non-Markovian, potentially multi-modal (i.e. ``complex" ) expert demonstrations in nonlinear dynamical systems. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a stochastic continuity property of the learned policy we call ``total variation continuity" (TVC), an imitator that accurately estimates actions on the demonstrator's state distribution closely matches the demonstrator's distribution over entire trajectories. We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations.
翻译:我们提出一个理论框架,用于研究非线性动力系统中对随机、非马尔可夫、潜在多模态(即“复杂”)专家演示的行为克隆。该框架引入低层级控制器(无论是学习的还是位置指令控制中隐含的)来稳定专家演示周围的模仿行为。我们证明,在满足(a)合适的低层级稳定性保证和(b)所学策略的随机连续性属性(称为“全变差连续性”,TVC)的条件下,一个能够准确估计演示者状态分布上动作的模仿者,将紧密匹配演示者在整个轨迹上的分布。随后我们表明,通过将流行的数据增强方案与一项新颖的算法技巧相结合——在执行时刻添加增强噪声——可以在最低限度降低准确性的情况下确保TVC。我们针对由扩散模型参数化的策略实例化上述保证,并证明如果学习者准确估计了(噪声增强后的)专家策略的得分,那么模仿者轨迹的分布在自然最优传输距离上接近演示者分布。我们的分析构建了噪声增强轨迹之间的复杂耦合,这一技术可能具有独立的研究价值。最后,我们通过实验验证了所提出的算法建议。