Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$ hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.
翻译:视觉-语言-动作(VLA)策略在长时域和高不确定性控制中仍存在脆弱性,其单次动作解码提供的推理时推演能力有限。显式思维链虽能提升推理深度,但会引入令牌延迟及间接的文本-动作接口。我们提出MPCot,一种奖励引导的多路径潜在推理框架:初始化M个假设路径,经K步权重共享迭代优化后,在动作解码前进行软聚合。训练阶段仅需路径偏好目标,该目标通过专家动作一致性、基于世界模型/视觉语言模型的进度评估及成功反馈信号,对齐候选动作分支与下游执行质量,从而训练潜在路径评分器。MPCot保持原始8步动作接口,零推理令牌生成,并提供可配置的推理控制参数(K,M)。在LIBERO和CALVIN基准的匹配协议下,MPCot显著提升长时域性能,消融实验进一步验证了深度-宽度效应、置信加权聚合及奖励引导路径监督的有效性。