统一具身视觉语言模型推理与机器人动作的自动回归离散化预训练 (Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training)

Yi Liu,Sukai Wang,Dafeng Wei,Xiaowei Cai,Linqing Zhong,Jiange Yang,Guanghui Ren,Jinyu Zhang,Maoqing Yao,Chuankang Li,Xindong He,Liliang Chen,Jianlan Luo

General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation. Project page: https://geniereasoner.github.io/GenieReasoner/

翻译：在开放世界环境中运行的通用机器人系统必须同时实现广泛泛化与高精度动作执行，这一组合对现有视觉-语言-动作（VLA）模型仍具挑战性。尽管大型视觉语言模型（VLM）提升了语义泛化能力，但不足的具身推理会导致行为脆弱性；反之，仅具备强大推理能力而缺乏精确控制亦不充分。为对此瓶颈进行解耦与量化评估，我们提出了具身推理智商（ERIQ）——一个面向机器人操作的大规模具身推理基准，包含跨越四个推理维度的6000余组问答对。通过将推理与执行解耦，ERIQ支持系统性评估，并揭示了具身推理能力与端到端VLA泛化性能之间的强正相关性。为弥合推理到精确执行的鸿沟，我们提出FACT——一种基于流匹配的动作分词器，可将连续控制转化为离散序列，同时保持高保真轨迹重建能力。由此构建的GenieReasoner在统一空间中联合优化推理与动作，在真实世界任务中表现优于连续动作基准及现有离散动作基线。ERIQ与FACT共同构成了诊断并克服推理-精度权衡的原则性框架，推动了鲁棒通用机器人操作技术的发展。项目页面：https://geniereasoner.github.io/GenieReasoner/