Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but require access to diverse training environments and full trajectory observability, assumptions that fail in privacy-preserving or restricted scenarios where only scalar performance metrics are available. We propose Generalization via Evolutionary Reward Shaping (GERS), a bilevel optimization approach to improve generalization on unseen test environments using only scalar feedback from validation environments. At the lower level, an RL agent guided via a reward function shaped by the upper level learns a policy on a limited set of training environments with accessible trajectory data; at the upper level, CMA-ES optimizes the reward shaping parameters to maximize the cumulative unshaped reward on separate validation environments for which trajectory access is unavailable. Results on continuous control tasks indicate that GERS outperforms the standard RL baseline on unseen test environments. GERS performance is comparable to DR, despite DR treating the combined set of training and validation environments of GERS as a single training set that requires trajectory access, whereas GERS cannot access validation trajectories. These results confirm that GERS effectively enhances generalization under restricted data access constraints.
翻译:强化学习(RL)在部署到与训练环境不同的环境中时,常出现性能退化问题。现有技术如域随机化(DR)可缓解此问题,但需要访问多样化的训练环境及完整的轨迹可观测性——这些假设在仅能获取标量性能指标的隐私保护或受限场景中无法成立。我们提出基于进化式奖励塑造的泛化方法(GERS),这是一种双层优化方法,仅利用验证环境的标量反馈提升对未见测试环境的泛化能力。在下层,受上层塑造的奖励函数引导的RL智能体,在可访问轨迹数据的有限训练环境集合上学习策略;在上层,CMA-ES优化奖励塑造参数,以最大化无法获取轨迹的独立验证环境上的累积未塑造奖励。连续控制任务的实验表明,GERS在未见测试环境上的表现优于标准RL基线。尽管DR将GERS的训练与验证环境组合视为单一需轨迹访问的训练集,而GERS无法获取验证轨迹,但GERS的性能与DR相当。这些结果证实了在受限数据访问约束下GERS能有效提升泛化能力。