Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.
翻译:通过强化学习微调预训练机器人策略通常受限于行为克隆预训练引入的瓶颈,后者产生狭窄的动作分布,缺乏下游探索所需的覆盖范围。我们提出一个统一框架,通过桥接行为克隆预训练与强化学习微调,实现高效机器人策略微调所需的探索。我们的预训练方法——上下文平滑预训练,通过向策略输入注入前向扩散噪声,在精确模仿与广泛动作覆盖之间建立连续谱。随后通过时间步调制强化学习(TMRL)对预训练策略进行微调,该方法训练智能体在微调过程中通过调节扩散时间步动态调整该条件化机制,从而实现对探索的显式控制。该方法可无缝集成任意策略输入(如状态、三维点云或基于图像的视觉-语言-动作策略),实验表明TMRL提升了强化学习微调的样本效率。值得注意的是,TMRL能够在不到一小时内实现对复杂操作任务的成功实机微调。视频与代码见 https://weirdlabuw.github.io/tmrl/。