HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning

Taowen Wang,Zikang Xie,Bin Yang,Yunheng Wang,Zizhao Yuan,Yuetong Fang,Yixiao Feng,Yichi Wang,Xingyu Chen,Haodong Chen,Qiwei Wu,Weisheng Xu,Lihan Chen,Lusong Li,Zecui Zeng,Renjing Xu

from arxiv, 29 pages, 13 figures, 10 tables

Humanoid robots promise whole-body interaction in human-centered environments, but scalable policy learning remains difficult because task-level decision-making and whole-body dynamic execution are tightly coupled. A practical solution is hierarchical control, where a high-level policy predicts intermediate whole-body actions and low-level general motion trackers (GMTs) execute them as stable humanoid motion. However, existing benchmarks rarely evaluate the policy-tracker interface itself, leaving open whether intermediate whole-body actions are executable, robust under task distribution shifts, and transferable across different GMT backends. We introduce HumanoidArena, a simulation-first benchmark for egocentric hierarchical whole-body learning. The benchmark formulates policy learning as a hierarchical decision making problem: a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action, which is subsequently executed by a low-level GMT. Instead of treating the legs as planar transport tools, HumanoidArena emphasizes interactions where lower-body coordination is structurally necessary in task completion. We therefore design 7 leg-critical HOI/HSI tasks in which success requires foot placement, balance maintenance, posture adjustment, and whole-body reorientation. To further diagnose the hierarchical system, we evaluate policies from two complementary perspectives: perturbation-conditioned generalization and GMT-conditioned transfer. Experiments show that hierarchical control enables learned policies to solve diverse leg-critical interactions, but performance is strongly tracker-conditioned and cross-GMT transfer remains fragile. These results position HumanoidArena as a benchmark for studying transferable intermediate action representations and scalable egocentric whole-body policy learning.

翻译：人形机器人有望在人本环境中实现全身交互，但可扩展的策略学习仍面临挑战，因为任务级决策与全身动态执行紧密耦合。一种实用方案是层次化控制：高层策略预测中间全身动作，低层通用运动跟踪器（GMT）将其执行为稳定的人形运动。然而现有基准极少评估策略-跟踪器接口本身，使得中间全身动作的可执行性、任务分布偏移下的鲁棒性以及跨不同GMT后端的可迁移性仍属未知。我们提出HumanoidArena——首个面向自我中心层次化全身运动的仿真优先基准。该基准将策略学习建模为层次化决策问题：高层策略将自我中心视觉、本体感知与指令转化为紧凑的全身动作，随后由低层GMT执行。区别于将腿部视为平面运输工具，HumanoidArena强调下肢协调在任务完成中具有结构必要性的交互场景。为此我们设计7项下肢关键的人-物/人-场景交互任务，其成功需要脚部落点、平衡维持、姿态调整及全身重定向。为进一步诊断层次化系统，我们从两个互补视角评估策略：扰动条件泛化与GMT条件迁移。实验表明层次化控制使习得策略能解决多样化的下肢关键交互，但性能强烈依赖于跟踪器，且跨GMT迁移仍不稳定。这些结果将HumanoidArena定位为研究可迁移中间动作表征与可扩展自我中心全身策略学习的基准平台。