Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO-based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall-clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly across the trajectory, including phases the policy already handles after pre-training and supervised fine-tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop-in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success-failure action variance, a rollout-derived proxy for per-phase gradient variance, and samples a fixed chunk budget with online-updated phase-level keep probabilities. We formalize per-phase gradient variance as the quantity determines where gradient computation is useful and show that success-failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38 times wall-clock speedup, 4.8 times faster gradient updates, and 60% lower peak activation memory, while backpropagating through fewer than 20% of trajectory chunks.
翻译:强化学习(RL)使视觉-语言-动作(VLA)策略能够通过直接优化任务成功率来泛化到训练分布之外,但后训练过程的计算成本高昂。一种自然应对策略是通过更快的模拟器和世界模型加速轨迹采样。在基于GRPO的VLA强化学习中,我们发现主要计算成本来自其他环节:在我们的运行中,梯度计算约占每步墙钟时间的78%,而轨迹采样仅占21%。梯度计算成本高的原因在于大部分计算被用于对学习贡献甚微的阶段。GRPO的学习信号由优势方差驱动:只有成功和失败轨迹产生差异的阶段才能提供学习信号。然而,GRPO对轨迹中的每个分块赋予相同的优势值。因此,演员网络更新计算被均匀分配至整个轨迹,包括在预训练和有监督微调后策略已能处理的阶段。本文提出概率性分块掩码(PCM),作为一种对GRPO的即插即用改进,它仅对每个轨迹中的一个小型概率选择分块子集分配梯度计算。PCM使用成功-失败动作方差(一种从轨迹导出的每分块梯度方差的替代指标)对语义阶段进行评分,并通过在线更新的阶段级保留概率采样固定数量的分块预算。我们形式化定义了每分块梯度方差,将其作为决定梯度计算有效性的关键量,并证明成功-失败动作方差可作为其可测量的替代指标。PCM无需奖励模型或学习型评论家。在三个LIBERO基准测试中,PCM在达到标准GRPO最终成功率的同时,实现了2.38倍的墙钟加速、4.8倍更快的梯度更新、峰值激活内存降低60%,且仅需回传少于20%的轨迹分块梯度。