HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

Shuanghao Bai,Meng Li,Xinyuan Lv,Jiawei Wang,Xinhua Wang,Fei Liao,Chengkai Hou,Langzhe Gu,Wanqi Zhou,Kun Wu,Ziluo Ding,Zhiyuan Xu,Lei Sun,Shanghang Zhang,Zhengping Che,Jian Tang,Badong Chen

from arxiv, Project page: https://hex-humanoid.github.io/

Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.

翻译：人类通过协调的全身控制实现复杂操作，而大多数视觉-语言-动作(VLA)模型将机器人身体部件视为独立单元，这使得高自由度人形机器人的控制充满挑战且往往不稳定。我们提出HEX，一种面向全尺寸双足人形机器人协调操控的状态中心框架。HEX引入人类对齐的通用状态表征，支持跨异构实体的可扩展学习，并集成混合专家统一本体感知预测器，通过大规模多实体轨迹数据建模全身协调与时间运动动态。为高效捕捉时间视觉上下文，HEX采用轻量级历史令牌总结过往观测，在推理时避免重复编码历史图像。该方法进一步运用残差门控融合机制与流匹配动作头，自适应整合视觉-语言线索与本体感知动态以生成动作。真实人形机器人操控实验表明，HEX在任务成功率和泛化能力上达到最优性能，尤其在快速反应与长时域任务场景中表现突出。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

面向具身操作的高效视觉–语言–动作模型：系统综述

专知会员服务

26+阅读 · 2025年10月22日

【CMU博士论文】交互驱动的人体动作估计与生成

专知会员服务

18+阅读 · 2025年9月17日

面向具身操作的视觉-语言-动作模型综述

专知会员服务

28+阅读 · 2025年8月23日

面向机器人操作的基于大型视觉‑语言模型（VLM）的视觉‑语言‑动作（VLA）模型综述

专知会员服务

34+阅读 · 2025年8月19日