Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded. Existing Vision-Language-Action policies usually hide planning in latent states or expose only one modality: text-only chain-of-thought encodes causal order but misses spatial constraints, while visual prediction provides geometric cues but often remains local and semantically underconstrained. We introduce Interleaved Vision--Language Reasoning (IVLR), a policy framework built around \trace{}, an explicit intermediate representation that alternates textual subgoals with visual keyframes over the full task horizon. At test time, a single native multimodal transformer self-generates this global semantic-geometric trace from the initial observation and instruction, caches it, and conditions a closed-loop action decoder on the trace, original instruction, and current observation. Because standard robot datasets lack such traces, we construct pseudo-supervision by temporally segmenting demonstrations and captioning each stage with a vision-language model. Across simulated benchmarks for long-horizon manipulation and visual distribution shift, \method{} reaches 95.5\% average success on LIBERO, including 92.4\% on LIBERO-Long, and 59.4\% overall success on SimplerEnv-WidowX. Ablations show that both modalities are necessary: without traces, LIBERO-Long success drops to 37.7\%; text-only and vision-only traces reach 62.0\% and 68.4\%, while the full interleaved trace reaches 92.4\%. Stress tests with execution perturbations and masked trace content show moderate degradation, suggesting that the trace can tolerate local corruption and moderate execution drift, but remains limited under stale or incorrect global plans.
翻译:长时域机器人操作需要既逻辑连贯又具有几何基础的计划。现有视觉-语言-动作策略通常将规划隐藏在隐状态中,或仅暴露单一模态:纯文本思维链可编码因果顺序但缺失空间约束,而视觉预测虽提供几何线索却常局限于局部且语义约束不足。我们提出交错式视觉-语言推理(IVLR)——一种基于\trace{}构建的策略框架,该框架采用显式中间表征,在完整任务时域上交替呈现文本子目标与视觉关键帧。测试时,单个原生多模态Transformer从初始观测与指令自主生成此全局语义-几何轨迹并缓存,使闭环动作解码器基于该轨迹、原始指令及当前观测进行条件生成。由于标准机器人数据集缺乏此类轨迹,我们通过时间分割示范数据并利用视觉语言模型生成每个阶段的伪监督标注。在面向长时域操作与视觉分布偏移的模拟基准测试中,\method{}在LIBERO上达到95.5%平均成功率(其中LIBERO-Long为92.4%),在SimplerEnv-WidowX上总体成功率达59.4%。消融实验表明双模态不可或缺:无轨迹时LIBERO-Long成功率降至37.7%;纯文本与纯视觉轨迹分别达62.0%和68.4%,而完整交错轨迹达92.4%。执行扰动与轨迹内容掩码的压力测试显示性能适度下降,表明该轨迹可容忍局部损坏与适度执行漂移,但在全局计划过时或错误时仍存在局限性。