Humanoid robot loco-manipulation remains constrained by the semantic-physical gap. Current methods face three limitations: Low sample efficiency in reinforcement learning, poor generalization in imitation learning, and physical inconsistency in VLMs. We propose MetaWorld, a hierarchical world model that integrates semantic planning and physical control via expert policy transfer. The framework decouples tasks into a VLM-driven semantic layer and a latent dynamics model operating in a compact state space. Our dynamic expert selection and motion prior fusion mechanism leverages a pre-trained multi-expert policy library as transferable knowledge, enabling efficient online adaptation via a two-stage framework. VLMs serve as semantic interfaces, mapping instructions to executable skills and bypassing symbol grounding. Experiments on Humanoid-Bench show MetaWorld outperforms world model-based RL in task completion and motion coherence. Our code will be found at https://anonymous.4open.science/r/metaworld-2BF4/
翻译:人形机器人的移动操作仍受限于语义-物理鸿沟。现有方法面临三大局限:强化学习的样本效率低下、模仿学习的泛化能力不足,以及视觉语言模型(VLM)的物理不一致性。本文提出MetaWorld——一种通过专家策略迁移实现语义规划与物理控制融合的分层世界模型。该框架将任务解耦为VLM驱动的语义层和在紧凑状态空间中运行的潜在动力学模型。我们提出的动态专家选择与运动先验融合机制,利用预训练的多专家策略库作为可迁移知识,通过两阶段框架实现高效的在线适应。VLMs作为语义接口,将指令映射为可执行技能并绕过符号接地问题。在Humanoid-Bench上的实验表明,MetaWorld在任务完成度与运动连贯性上均优于基于世界模型的强化学习方法。代码发布于https://anonymous.4open.science/r/metaworld-2BF4/