Image-based reinforcement learning (RL) faces significant challenges in generalization when the visual environment undergoes substantial changes between training and deployment. Under such circumstances, learned policies may not perform well leading to degraded results. Previous approaches to this problem have largely focused on broadening the training observation distribution, employing techniques like data augmentation and domain randomization. However, given the sequential nature of the RL decision-making problem, it is often the case that residual errors are propagated by the learned policy model and accumulate throughout the trajectory, resulting in highly degraded performance. In this paper, we leverage the observation that predicted rewards under domain shift, even though imperfect, can still be a useful signal to guide fine-tuning. We exploit this property to fine-tune a policy using reward prediction in the target domain. We have found that, even under significant domain shift, the predicted reward can still provide meaningful signal and fine-tuning substantially improves the original policy. Our approach, termed Predicted Reward Fine-tuning (PRFT), improves performance across diverse tasks in both simulated benchmarks and real-world experiments. More information is available at project web page: https://sites.google.com/view/prft.
翻译:基于图像的强化学习(RL)在训练与部署期间的视觉环境发生显著变化时,其泛化能力面临重大挑战。在此类情况下,习得的策略可能表现不佳,导致结果退化。先前解决此问题的方法主要集中于扩大训练观测分布,采用数据增强和领域随机化等技术。然而,鉴于强化学习决策问题的序列性质,习得的策略模型往往会传播残差误差,并在整个轨迹中累积,从而导致性能严重下降。本文利用以下观察:在领域偏移下的预测奖励即使不完美,仍可作为指导微调的有用信号。我们利用这一特性,通过在目标域中进行奖励预测来微调策略。我们发现,即使在显著的领域偏移下,预测奖励仍能提供有意义的信号,且微调能显著改进原始策略。我们提出的方法称为预测奖励微调(PRFT),在模拟基准测试和真实世界实验中的多种任务上均提升了性能。更多信息请访问项目网页:https://sites.google.com/view/prft。