While vision-language-action (VLA) models have advanced generalist robotic learning, cross-embodiment transfer remains challenging due to kinematic heterogeneity and the high cost of collecting sufficient real-world demonstrations to support fine-tuning. Existing cross-embodiment policies typically rely on shared-private architectures, which suffer from limited capacity of private parameters and lack explicit adaptation mechanisms. To address these limitations, we introduce MOTIF for efficient few-shot cross-embodiment transfer that decouples embodiment-agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data. Specifically, MOTIF first learns unified motifs via vector quantization with progress-aware alignment and embodiment adversarial constraints to ensure temporal and cross-embodiment consistency. We then design a lightweight predictor that predicts these motifs from real-time inputs to guide a flow-matching policy, fusing them with robot-specific states to enable action generation on new embodiments. Evaluations across both simulation and real-world environments validate the superiority of MOTIF, which significantly outperforms strong baselines in few-shot transfer scenarios by 6.5% in simulation and 43.7% in real-world settings. Code is available at https://github.com/buduz/MOTIF.
翻译:尽管视觉-语言-动作(VLA)模型已推动通用机器人学习的发展,但由于运动学异构性以及收集足够真实世界演示数据以支持微调的高昂成本,跨具身迁移仍然面临挑战。现有的跨具身策略通常依赖于共享-私有架构,这种架构受限于私有参数容量不足且缺乏显式的适应机制。为解决这些局限性,我们提出MOTIF,用于实现高效的少样本跨具身迁移。该方法从异构动作数据中解耦出与具身无关的时空模式,即动作基元。具体而言,MOTIF首先通过向量量化学习统一的基元,该过程结合了进度感知对齐和具身对抗约束,以确保时间一致性和跨具身一致性。随后,我们设计了一个轻量级预测器,用于从实时输入中预测这些基元,以指导流匹配策略,并将基元与机器人特定状态融合,从而在新具身上实现动作生成。在仿真和真实环境中的评估验证了MOTIF的优越性,其在少样本迁移场景中显著优于强基线模型,仿真性能提升6.5%,真实世界性能提升43.7%。代码可在 https://github.com/buduz/MOTIF 获取。