Most vision-language-action (VLA) models map observations directly to actions without explicit intermediate planning, which limits performance on long-horizon tasks where early mistakes compound. We propose Coarse-to-Control, a plan-execute VLA that introduces planning natively in the action-token space. The key idea is to let the policy first predict a compact sequence of coarse action tokens that summarize the intended future trajectory, and then generate executable action tokens conditioned on this plan. Because both planning and execution share a unified discrete action vocabulary, the plan stays close to the control manifold and provides directly actionable guidance rather than an abstract hint that must be translated back to motor commands. Experiments on LIBERO, SimplerEnv-WidowX, and real-world manipulation tasks show that action-token planning consistently improves over direct action generation, with the largest gains on long-horizon multi-stage tasks.
翻译:大多数视觉-语言-动作(VLA)模型直接根据观测映射到动作,而缺乏显式的中间规划过程,这限制了其在长时程任务中的性能——早期错误会逐步累积。我们提出“粗到细控制”(Coarse-to-Control)框架,这是一种内置行动词元空间规划能力的“规划-执行”式VLA模型。核心思路是:让策略首先预测一个紧凑的粗粒度行动词元序列,用于概括预期的未来轨迹;随后基于该规划生成可执行的行动词元。由于规划与执行共享同一离散动作词汇表,规划结果始终贴近控制流形,提供可直接执行的引导信号,而非需要二次翻译为电机指令的抽象提示。在LIBERO、SimplerEnv-WidowX及真实机器人操作任务上的实验表明:行动词元规划方法在直接动作生成基础上持续提升性能,其中长时程多阶段任务的增益最为显著。