Multi-Fidelity Policy Gradient Algorithms

Many reinforcement learning (RL) algorithms require large amounts of data, prohibiting their use in applications where frequent interactions with operational systems are infeasible, or high-fidelity simulations are expensive or unavailable. Meanwhile, low-fidelity simulators--such as reduced-order models, heuristic reward functions, or generative world models--can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a large volume of low-fidelity simulation data to form unbiased, reduced-variance estimators (control variates) for on-policy policy gradients. We instantiate the framework by developing multi-fidelity variants of two policy gradient algorithms: REINFORCE and proximal policy optimization. Experimental results across a suite of simulated robotics benchmark problems demonstrate that when target-environment samples are limited, MFPG achieves up to 3.9x higher reward and improves training stability when compared to baselines that only use high-fidelity data. Moreover, even when the baselines are given more high-fidelity samples--up to 10x as many interactions with the target environment--MFPG continues to match or outperform them. Finally, we observe that MFPG is capable of training effective policies even when the low-fidelity environment is drastically different from the target environment. MFPG thus not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.

翻译：许多强化学习算法需要大量数据，这限制了它们在无法频繁与操作系统交互、或高保真仿真成本过高/不可用场景中的应用。与此同时，低保真度仿真器（如降阶模型、启发式奖励函数或生成式世界模型）虽因精度不足无法直接实现仿真到现实的迁移，却能为强化学习训练廉价提供有效数据。本文提出多保真度策略梯度算法框架，通过混合少量目标环境数据与大量低保真仿真数据，构建出无偏且低方差（控制变量）的在线策略梯度估计量。我们通过开发两种策略梯度算法的多保真度变体——REINFORCE与近端策略优化——对该框架进行实例化。在系列机器人仿真基准测试中的实验结果表明：当目标环境样本有限时，相较于仅使用高保真数据的基线方法，多保真度策略梯度算法能获得最高3.9倍的奖励提升并改善训练稳定性。即使基线方法获得更多高保真样本（与目标环境的交互量增加至10倍），多保真度策略梯度算法仍能保持相当或更优性能。此外，研究还发现即使低保真环境与目标环境存在显著差异，该算法仍能训练出有效策略。因此，多保真度策略梯度算法不仅为高效的仿真到现实迁移提供了新范式，还为平衡策略性能与数据采集成本提供了理论框架。