A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

Many reinforcement learning (RL) algorithms are impractical for training in operational systems or computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators, e.g., reduced-order models, heuristic rewards, or learned world models, can cheaply provide useful data, even if they are too coarse for zero-shot transfer. We propose multi-fidelity policy gradients (MFPGs), a sample-efficient RL framework that mixes scarce target-environment data with a control variate formed from abundant low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework with a practical, multi-fidelity variant of the classical REINFORCE algorithm. Under standard assumptions, the MFPG estimator guarantees asymptotic convergence to locally optimal policies in the target environment and achieves faster finite-sample convergence than standard REINFORCE. We evaluate MFPG on robotics benchmark tasks with limited high-fidelity data but abundant off-dynamics, low-fidelity data. When low-fidelity data are neutral or beneficial and dynamics gaps are mild-moderate, MFPG is, among the evaluated off-dynamics RL and low-fidelity-only approaches, the only method that consistently achieves statistically significant improvements over a high-fidelity-only baseline. When low-fidelity data become harmful, MFPG exhibits the strongest robustness, whereas strong off-dynamics RL methods exploit low-fidelity data aggressively and fail much more severely. An additional experiment with anti-correlated high- and low-fidelity rewards shows MFPG can remain effective even under reward misspecification. MFPG thus offers a reliable paradigm for exploiting cheap low-fidelity data (e.g., for efficient sim-to-real transfer) while managing the trade-off between policy performance and data collection cost.

翻译：许多强化学习算法在操作系统中训练或在计算成本高昂的高保真度仿真中训练时不切实际，因为它们需要大量数据。与此同时，低保真度仿真器（例如降阶模型、启发式奖励或学习的世界模型）能够廉价地提供有用数据，即使这些数据对于零样本迁移而言过于粗糙。我们提出了多保真度策略梯度方法，这是一种样本高效的强化学习框架，它将稀缺的目标环境数据与由丰富的低保真度仿真数据形成的控制变量相结合，构建了一个无偏且方差降低的在线策略梯度估计器。我们通过经典REINFORCE算法的实用多保真度变体实例化了该框架。在标准假设下，MFPG估计器保证在目标环境中渐近收敛于局部最优策略，并且比标准REINFORCE实现更快的有限样本收敛。我们在具有有限高保真度数据但存在大量异动态低保真度数据的机器人基准任务上评估了MFPG。当低保真度数据呈中性或有益且动态差距为轻度至中度时，在评估的异动态强化学习和仅低保真度方法中，MFPG是唯一能够持续实现相对于仅高保真度基线具有统计显著性改进的方法。当低保真度数据变得有害时，MFPG展现出最强的鲁棒性，而强大的异动态强化学习方法会激进地利用低保真度数据并导致更严重的失败。一项针对高保真度与低保真度奖励呈负相关的额外实验表明，即使在奖励设定错误的情况下，MFPG仍能保持有效性。因此，MFPG为利用廉价低保真度数据（例如用于高效的仿真到现实迁移）提供了一种可靠范式，同时管理了策略性能与数据收集成本之间的权衡。