OmniEVA：通过任务自适应三维基础与具身感知推理实现的通用具身规划器 (OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning)

Yuecheng Liu,Dafeng Chi,Shiguang Wu,Zhanguang Zhang,Yuzheng Zhuang,Bowen Yang,He Zhu,Lingfeng Zhang,Pengwei Xie,David Gamaliel Arcos Bravo,Yingxue Zhang,Jianye Hao,Xingyue Quan

from arxiv, Published as a conference paper at ICLR 2026

Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible. To address these gaps, we introduce OmniEVA -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io

翻译：近年来，多模态大语言模型（MLLMs）的进展为具身智能开辟了新的机遇，使其能够进行多模态理解、推理与交互，以及连续的空间决策。然而，当前基于MLLM的具身系统面临两个关键局限。首先，几何适应性鸿沟：仅基于二维输入训练或采用硬编码三维几何注入的模型，要么空间信息不足，要么二维泛化能力受限，导致其在空间需求各异的任务中适应性差。其次，具身约束鸿沟：先前工作常忽略真实机器人的物理约束与能力，导致生成的任务计划理论上有效但实际不可行。为应对这些鸿沟，我们提出了OmniEVA——一种通用具身规划器，通过两项关键创新实现先进的具身推理与任务规划：（1）任务自适应三维基础机制，引入门控路由器，根据上下文需求对三维融合进行显式的选择性调控，从而为多样化的具身任务实现上下文感知的三维基础。（2）具身感知推理框架，将任务目标与具身约束共同纳入推理循环，从而生成既目标导向又可执行的规划决策。大量实验结果表明，OmniEVA不仅在通用具身推理性能上达到最先进水平，还在广泛的下游场景中展现出强大能力。对一系列提出的具身基准（包括基础任务与复合任务）的评估，证实了其稳健且通用的规划能力。项目页面：https://omnieva.github.io