Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .
翻译:当前一些性能最优的强化学习算法往往因采用测试时扩展方法(如采样多个动作候选并选择最优者)而计算成本过高。本文提出FASTER方法,通过将动作样本的性能增益追溯至去噪过程的早期阶段,在无需增加计算开销的前提下,获得基于采样的扩散策略测试时扩展优势。核心洞察在于:可将多动作候选去噪与最优选择过程建模为马尔可夫决策过程(MDP),其目标是在去噪完成前逐步过滤动作候选。基于该MDP,我们在去噪空间中学习策略函数与价值函数,用以预测动作候选在去噪过程中的下游价值,并在最大化回报的同时对其进行筛选。该方法轻量化设计,可即插即用于现有生成式强化学习算法。在在线及批次-在线强化学习的长程操控任务中,FASTER持续提升基础策略性能,并在对比方法中取得最优综合表现。当应用于预训练VLA模型时,FASTER在保持同等性能的前提下,显著降低训练与推理计算需求。代码开源地址:https://github.com/alexanderswerdlow/faster。