Embodied reasoning requires models to perceive task-relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi-step reasoning. However, current vision-language models rely on text-only or coordinate-augmented chain-of-thought, where entity references remain implicit and ambiguous. This may cause the reasoning process to decouple from visual evidence, entity references to drift across steps, and a causal disconnection between the reasoning trajectory and the final answer, with these problems further amplified in multi-view scenarios due to cross-view appearance changes. To address these issues, we propose Pinned Chain-of-Thought (\pincot{}), a structured reasoning paradigm that pins every reasoning step to visual evidence. \pincot{} introduces the concept of \reasoninganchor{}, which binds each task-relevant entity to a structured visual anchor with entity name, unique identity, view index, and spatial grounding, enabling consistent entity tracking across reasoning steps and views. We build a fully automated data generation pipeline to construct \dataset{}, a high-quality \pincot{}-formatted reasoning dataset. We then train \method{} through three-stage post-training that progressively injects embodied knowledge, structured reasoning ability, and process-supervised alignment, with rewards that directly constrain both anchor localization and identity consistency during reasoning. On 14 benchmarks covering embodied spatial reasoning, multi-view reasoning, and pointing, \method{} with only 4B parameters consistently outperforms 7B level open-source embodied models, achieving a 12\% average improvement over the strongest 7B baseline, Mimo-Embodied. Further analysis shows that \pincot{} improves grounding accuracy and cross-step identity consistency, validating the effectiveness of process supervision.
翻译:[translated abstract in Chinese]
具身推理要求模型能够感知物理环境中与任务相关的物体和空间,并在多步推理过程中保持一致的视觉接地。然而,当前的视觉-语言模型依赖纯文本或坐标增强的思维链方法,其中实体引用隐含且模糊。这可能导致推理过程与视觉证据脱钩、跨步骤实体引用漂移、以及推理轨迹与最终答案之间存在因果断裂,这些问题在多视角场景中会因跨视角外观变化而进一步放大。为解决上述问题,我们提出固定思维链(\pincot{}),一种将每个推理步骤固着于视觉证据的结构化推理范式。\pincot{}引入推理锚点概念,通过实体名称、唯一标识、视角索引和空间定位将每个任务相关实体绑定到结构化视觉锚点,实现跨推理步骤和视角的实体一致性追踪。我们构建全自动数据生成流水线,生成高质量\pincot{}格式推理数据集\dataset{}。随后通过三阶段后训练训练\method{}模型,逐步注入具身知识、结构化推理能力和过程监督对齐,奖励函数直接约束推理过程中的锚点定位准确性及身份一致性。在涵盖具身空间推理、多视角推理和指向的14个基准测试中,仅4B参数的\method{}持续优于7B级别的开源具身模型,较最强7B基线Mimo-Embodied实现12%的平均性能提升。进一步分析表明,\pincot{}提升了接地准确性与跨步骤身份一致性,验证了过程监督的有效性。