Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.
翻译:人类通过协调的全身控制实现复杂操作,而大多数视觉-语言-动作(VLA)模型将机器人身体部件视为独立单元,这使得高自由度人形机器人的控制充满挑战且往往不稳定。我们提出HEX,一种面向全尺寸双足人形机器人协调操控的状态中心框架。HEX引入人类对齐的通用状态表征,支持跨异构实体的可扩展学习,并集成混合专家统一本体感知预测器,通过大规模多实体轨迹数据建模全身协调与时间运动动态。为高效捕捉时间视觉上下文,HEX采用轻量级历史令牌总结过往观测,在推理时避免重复编码历史图像。该方法进一步运用残差门控融合机制与流匹配动作头,自适应整合视觉-语言线索与本体感知动态以生成动作。真实人形机器人操控实验表明,HEX在任务成功率和泛化能力上达到最优性能,尤其在快速反应与长时域任务场景中表现突出。