General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation.
翻译:在开放世界环境中运行的通用机器人系统必须同时实现广泛的泛化能力和高精度的动作执行,这一组合对现有的视觉-语言-动作模型仍具挑战性。尽管大型视觉-语言模型提升了语义泛化能力,但不足的具身推理会导致行为脆弱;反之,仅具备强大的推理能力而缺乏精确控制亦不充分。为对此瓶颈进行解耦与量化评估,我们引入了具身推理智商——一个大规模机器人操作具身推理基准,包含跨越四个推理维度的6000余个问答对。通过将推理与执行解耦,该基准支持系统性评估,并揭示了具身推理能力与端到端视觉-语言-动作泛化性能之间的强正相关性。为弥合推理到精确执行的鸿沟,我们提出了基于流匹配的动作分词器,该模型将连续控制转化为离散序列,同时保持高保真轨迹重建能力。由此产生的统一模型在联合优化空间内协同优化推理与动作,在现实任务中超越了连续动作及现有离散动作基线。该基准与动作分词器共同构成了诊断并克服推理-精度权衡的原则性框架,推动了鲁棒、通用的机器人操作技术的发展。