Mimic Intent, Not Just Trajectories

While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art approaches like Vision-Language-Action (VLA) models still struggle with adaptation to environmental changes and skill transfer. We argue this stems from mimicking raw trajectories without understanding the underlying intent. To address this, we propose explicitly disentangling behavior intent from execution details in end-2-end IL: \textit{``Mimic Intent, Not just Trajectories'' (MINT)}. We achieve this via \textit{multi-scale frequency-space tokenization}, which enforces a spectral decomposition of action chunk representation. We learn action tokens with a multi-scale coarse-to-fine structure, and force the coarsest token to capture low-frequency global structure and finer tokens to encode high-frequency details. This yields an abstract \textit{Intent token} that facilitates planning and transfer, and multi-scale \textit{Execution tokens} that enable precise adaptation to environmental dynamics. Building on this hierarchy, our policy generates trajectories through \textit{next-scale autoregression}, performing progressive \textit{intent-to-execution reasoning}, thus boosting learning efficiency and generalization. Crucially, this disentanglement enables \textit{one-shot transfer} of skills, by simply injecting the Intent token from a demonstration into the autoregressive generation process. Experiments on several manipulation benchmarks and on a real robot demonstrate state-of-the-art success rates, superior inference efficiency, robust generalization against disturbances, and effective one-shot transfer.

翻译：尽管模仿学习（IL）通过生成建模和预训练在灵巧操作任务中取得了令人瞩目的成功，但当前最先进的方法（如视觉-语言-动作模型）在适应环境变化和技能迁移方面仍面临困难。我们认为这源于仅模仿原始轨迹而未理解底层意图。为解决此问题，我们提出在端到端模仿学习中显式解耦行为意图与执行细节：\textit{``模仿意图，而非仅轨迹''（MINT）}。我们通过\textit{多尺度频域空间标记化}实现这一目标，该方法强制对动作块表示进行谱分解。我们学习具有多尺度由粗到细结构的动作标记，强制最粗粒度的标记捕获低频全局结构，而更细粒度的标记编码高频细节。这产生了一个抽象的\textit{意图标记}（便于规划与迁移）和多尺度\textit{执行标记}（支持对环境动态的精确适应）。基于此层级结构，我们的策略通过\textit{下一尺度自回归}生成轨迹，执行渐进式\textit{从意图到执行的推理}，从而提升学习效率与泛化能力。关键在于，这种解耦实现了技能的\textit{单次迁移}，仅需将演示中的意图标记注入自回归生成过程。在多个操作基准测试和真实机器人上的实验表明，该方法取得了最先进的成功率、卓越的推理效率、对抗扰动的鲁棒泛化能力以及有效的单次迁移性能。