The heterogeneity between high-level vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the Language-Grounded Decoupled Action Representation (LaDA) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives--translation, rotation, and gripper control--providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive weighting strategy, inspired by curriculum learning, dynamically balances contrastive and imitation objectives for stable and effective training. Extensive experiments on simulated benchmarks (LIBERO and MimicGen) and real-world demonstrations validate that LaDA achieves strong performance and generalizes effectively to unseen or related tasks.
翻译:高层视觉-语言理解与低层动作控制之间的异质性仍然是机器人操作中的一个基本挑战。尽管现有方法在任务特定动作对齐方面取得了进展,但它们通常难以针对新颖或语义相关的任务生成鲁棒且精确的动作。为解决这一问题,我们提出了语言引导的解耦动作表示框架,该框架利用自然语言作为连接感知与控制的语义桥梁。LaDA引入了一个细粒度的中间层,包含三种可解释的动作基元——平移、旋转和夹爪控制——为低层动作提供了明确的语义结构。它进一步采用了一种语义引导的软标签对比学习目标,以对齐跨任务的相似动作基元,从而增强泛化能力和运动一致性。受课程学习启发,一种自适应加权策略动态平衡对比学习和模仿学习目标,以实现稳定有效的训练。在模拟基准和真实世界演示上的大量实验验证了LaDA能够实现强劲的性能,并能有效泛化到未见或相关任务。