Action-Sketcher：通过视觉草图从推理到行动，实现长时程机器人操控 (Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation)

Huajie Tan,Peterson Co,Yijie Xu,Shanyu Rong,Yuheng Ji,Cheng Chi,Xiansheng Chen,Qiongyu Zhang,Zhongxia Zhao,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang

from arxiv, 26 pages, 14 figures

Long-horizon robotic manipulation is increasingly important for real-world deployment, requiring spatial disambiguation in complex layouts and temporal resilience under dynamic interaction. However, existing end-to-end and hierarchical Vision-Language-Action (VLA) policies often rely on text-only cues while keeping plan intent latent, which undermines referential grounding in cluttered or underspecified scenes, impedes effective task decomposition of long-horizon goals with close-loop interaction, and limits causal explanation by obscuring the rationale behind action choices. To address these issues, we first introduce Visual Sketch, an implausible visual intermediate that renders points, boxes, arrows, and typed relations in the robot's current views to externalize spatial intent, connect language to scene geometry. Building on Visual Sketch, we present Action-Sketcher, a VLA framework that operates in a cyclic See-Think-Sketch-Act workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. To enable scalable training and evaluation, we curate diverse corpus with interleaved images, text, Visual Sketch supervision, and action sequences, and train Action-Sketcher with a multi-stage curriculum recipe that combines interleaved sequence alignment for modality unification, language-to-sketch consistency for precise linguistic grounding, and imitation learning augmented with sketch-to-action reinforcement for robustness. Extensive experiments on cluttered scenes and multi-object tasks, in simulation and on real-world tasks, show improved long-horizon success, stronger robustness to dynamic scene changes, and enhanced interpretability via editable sketches and step-wise plans. Project website: https://action-sketcher.github.io

翻译：长时程机器人操控对于现实世界部署日益重要，它需要在复杂布局中进行空间解耦，并在动态交互下保持时序鲁棒性。然而，现有的端到端及分层视觉-语言-动作（VLA）策略通常仅依赖纯文本线索，同时将规划意图隐式表达，这削弱了在杂乱或未充分指定场景中的指代接地能力，阻碍了具有闭环交互的长时程目标的有效任务分解，并且由于掩盖了动作选择背后的逻辑而限制了因果解释。为解决这些问题，我们首先引入视觉草图，这是一种可解释的视觉中间表示，它在机器人当前视图中渲染点、框、箭头及类型化关系，以将空间意图外化，并将语言与场景几何连接起来。基于视觉草图，我们提出了Action-Sketcher，这是一个VLA框架，其运行于一个由自适应令牌门控策略协调的循环“观察-思考-草图-行动”工作流中，该策略用于触发推理、修订草图和发布动作，从而支持反应性修正和人类交互，同时保持实时动作预测。为实现可扩展的训练与评估，我们构建了一个包含交错图像、文本、视觉草图监督和动作序列的多样化语料库，并使用多阶段课程学习方案训练Action-Sketcher，该方案结合了用于模态统一的交错序列对齐、用于精确语言接地的语言-草图一致性，以及通过草图到动作强化学习增强的模仿学习以提升鲁棒性。在杂乱场景和多对象任务上进行的大量实验，包括仿真和真实世界任务，均显示出改进的长时程成功率、对动态场景变化更强的鲁棒性，以及通过可编辑草图和分步计划增强的可解释性。项目网站：https://action-sketcher.github.io