There has been rapid and dramatic progress in learning complex visuo-motor manipulation skills from demonstrations, thanks in part to expressive policy classes that employ diffusion- and transformer-based backbones. However, these design choices require significant data and computational resources and remain far from reliable, particularly within the context of multi-fingered dexterous manipulation. Fundamentally, they model skills as reactive mappings and rely on fixed-horizon action chunking to mitigate jitter, creating a rigid trade-off between temporal coherence and reactivity. In this work, we introduce Unified Behavioral Models (UBMs), a framework that learns to represent dexterous skills as coupled dynamical systems that capture how visual features of the environment (visual flow) and proprioceptive states of the robot (action flow) co-evolve. By capturing such behavioral dynamics, UBMs can ensure temporal coherence by construction rather than by heuristic averaging. To operationalize these models, we propose Koopman-UBM, a first instantiation of UBMs that leverages Koopman Operator theory to effectively learn a unified representation in which the joint flow of latent visual and proprioceptive features is governed by a structured linear system. We demonstrate that Koopman-UBM can be viewed as an implicit planner: given an initial condition, it computes the desired robot behavior with the resulting flow of visual features over the entire skill horizon. To enable reactivity, we introduce an online replanning strategy in which the model acts as its own runtime monitor that automatically triggers replanning when predicted and observed visual flow diverge. Across seven simulated and two real-world tasks, we demonstrate that K-UBM matches or exceeds the performance of SOTA baselines, while offering faster inference, smooth execution, robustness to occlusions, and flexible replanning.
翻译:近年来,通过示范学习复杂的视觉-运动操作技能取得了快速而显著的进展,这在一定程度上得益于采用扩散模型和Transformer架构的强表达能力策略类别。然而,这些设计选择需要大量的数据和计算资源,且仍远未达到可靠程度,尤其是在多指灵巧操作场景中。从根本上说,这些方法将技能建模为反应式映射,并依赖固定时长的动作分块来缓解抖动,从而在时间连贯性与反应性之间形成了僵化的权衡。本研究提出统一行为模型(UBMs)框架,该框架将灵巧技能学习为耦合动力系统,以捕捉环境视觉特征(视觉流)与机器人本体感知状态(动作流)的协同演化规律。通过捕获此类行为动力学特性,UBMs能够从系统构造层面(而非通过启发式平均)确保时间连贯性。为实现该模型的实际应用,我们提出Koopman-UBM——UBMs的首个实例化方案,其利用Koopman算子理论有效学习统一表征,使得潜在视觉特征与本体感知特征的联合演化受结构化线性系统支配。我们证明Koopman-UBM可被视为隐式规划器:给定初始条件,它能够计算整个技能时域内机器人预期行为及其伴随的视觉特征演化轨迹。为实现反应能力,我们引入在线重规划策略,使模型具备运行时自我监控功能,当预测视觉流与观测值发生偏离时自动触发重规划。在七项仿真任务和两项真实世界任务中,K-UBM在保持更快推理速度、平滑执行、遮挡鲁棒性和灵活重规划能力的同时,其性能达到或超越了当前最优基线方法。