Generative models have recently demonstrated remarkable success across diverse domains, motivating their adoption as expressive policies in reinforcement learning (RL). While they have shown strong performance in offline RL, particularly where the target distribution is well defined, their extension to online fine-tuning has largely been treated as a direct continuation of offline pre-training, leaving key challenges unaddressed. In this paper, we propose Flow Matching with Injected Noise for Offline-to-Online RL (FINO), a novel method that leverages flow matching-based policies to enhance sample efficiency for offline-to-online RL. FINO facilitates effective exploration by injecting noise into policy training, thereby encouraging a broader range of actions beyond those observed in the offline dataset. In addition to exploration-enhanced flow policy training, we combine an entropy-guided sampling mechanism to balance exploration and exploitation, allowing the policy to adapt its behavior throughout online fine-tuning. Experiments across diverse, challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets.
翻译:生成模型近期在多个领域展现出卓越成就,这推动了其作为表达能力强的策略在强化学习中的应用。尽管它们在离线强化学习中表现出色,尤其是在目标分布定义明确的情况下,但其向在线微调的扩展大多被视为离线预训练的直接延续,未解决关键挑战。本文提出一种用于离线到在线强化学习的新型方法——注入噪声的流匹配,该方法利用基于流匹配的策略提升离线到在线强化学习的样本效率。FINO通过在策略训练中注入噪声促进有效探索,从而鼓励智能体采取离线数据集中未观测到的更广泛动作。除了探索增强的流策略训练外,我们还结合熵引导采样机制来平衡探索与利用,使策略能够在整个在线微调过程中自适应调整行为。在多种挑战性任务上的实验表明,FINO在有限在线预算下始终取得更优性能。