What happens when a pretrained generative robot policy is provided a constant initial noise as input, rather than repeatedly sampling it from a Gaussian? We demonstrate that the performance of a pretrained, frozen diffusion or flow matching policy can be improved with respect to a downstream reward by swapping the sampling of initial noise from the prior distribution (typically isotropic Gaussian) with a well-chosen, constant initial noise input -- a golden ticket. We propose a search method to find golden tickets using Monte-Carlo policy evaluation that keeps the pretrained policy frozen, does not train any new networks, and is applicable to all diffusion/flow matching policies (and therefore many VLAs). Our approach to policy improvement makes no assumptions beyond being able to inject initial noise into the policy and calculate (sparse) task rewards of episode rollouts, making it deployable with no additional infrastructure or models. Our method improves the performance of policies in 38 out of 43 tasks across simulated and real-world robot manipulation benchmarks, with relative improvements in success rate by up to 58% for some simulated tasks, and 60% within 50 search episodes for real-world tasks. We also show unique benefits of golden tickets for multi-task settings: the diversity of behaviors from different tickets naturally defines a Pareto frontier for balancing different objectives (e.g., speed, success rates); in VLAs, we find that a golden ticket optimized for one task can also boost performance in other related tasks. We release a codebase with pretrained policies and golden tickets for simulation benchmarks using VLAs, diffusion policies, and flow matching policies.
翻译:当一个预训练的生成式机器人策略输入恒定初始噪声,而非从高斯分布中重复采样时会发生什么?我们证明,通过将从先验分布(通常是各向同性高斯分布)中采样的初始噪声替换为精心选择的恒定初始噪声输入(即"金券"),可以提升预训练且冻结的扩散或流匹配策略在后续奖励方面的表现。我们提出一种基于蒙特卡洛策略评估的搜索方法寻找金券,该方法保持预训练策略冻结,不训练任何新网络,且适用于所有扩散/流匹配策略(因此也适用于多种视觉-语言-动作模型)。我们的策略改进方法不假设除向策略注入初始噪声并计算回合展开的(稀疏)任务奖励以外的任何条件,因此无需额外基础设施或模型即可部署。该方法在模拟和真实世界机器人操作基准的43个任务中提升了38个策略的性能,某些模拟任务的相对成功率提升高达58%,而真实世界任务在50次搜索回合内提升达60%。我们还展示了金券在多任务设置中的独特优势:不同金券产生的行为多样性自然定义了平衡不同目标(如速度、成功率)的帕累托前沿;在视觉-语言-动作模型中,我们发现针对某任务优化的金券也能提升其他相关任务的性能。我们发布了包含预训练策略和金券的代码库,支持基于视觉-语言-动作模型、扩散策略和流匹配策略的模拟基准测试。