Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility $f(J_1^π,\dots,J_M^π)$ over multiple objectives, where each $J_m^π$ denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on $\partial f(J^π)$, while in practice only empirical return estimates $\hat J$ are available. Because $f$ is nonlinear, the plug-in estimator is biased ($\mathbb{E}[\partial f(\hat J)] \neq \partial f(\mathbb{E}[\hat J])$), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an intrinsic $\widetilde{\mathcal{O}}(ε^{-4})$ sample complexity due to this bias. To address this issue, we develop a Natural Policy Gradient (NPG) algorithm equipped with a multi-level Monte Carlo (MLMC) estimator that controls the bias of the scalarization gradient while maintaining low sampling cost. We prove that this approach achieves the optimal $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity for computing an $ε$-optimal policy. Furthermore, we show that when the scalarization function is second-order smooth, the first-order bias cancels automatically, allowing vanilla NPG to achieve the same $\widetilde{\mathcal{O}}(ε^{-2})$ rate without MLMC. Our results provide the first optimal sample complexity guarantees for concave multi-objective reinforcement learning under policy-gradient methods.

翻译：标准强化学习仅优化单一奖励信号，而许多应用需要在多个目标上优化非线性效用函数 $f(J_1^π,\dots,J_M^π)$，其中每个 $J_m^π$ 表示不同奖励函数的期望折扣回报。凹标量化是一种常用方法，能够捕捉公平性、风险敏感性等重要权衡关系。然而，非线性标量化给策略梯度方法带来了一个根本性挑战：梯度依赖于 $\partial f(J^π)$，而实践中只能获得经验回报估计 $\hat J$。由于 $f$ 的非线性特性，直接插入式估计量存在偏差（$\mathbb{E}[\partial f(\hat J)] \neq \partial f(\mathbb{E}[\hat J])$），导致持续性的梯度偏差并降低样本效率。本研究针对凹标量化多目标强化学习，识别并克服了这一偏差壁垒。我们证明现有策略梯度方法因该偏差存在固有的 $\widetilde{\mathcal{O}}(ε^{-4})$ 样本复杂度。为解决此问题，我们开发了一种配备多级蒙特卡洛（MLMC）估计器的自然策略梯度（NPG）算法，该算法在控制标量化梯度偏差的同时保持较低的采样成本。我们证明该方法在计算 $ε$ 最优策略时达到了最优的 $\widetilde{\mathcal{O}}(ε^{-2})$ 样本复杂度。此外，我们发现当标量化函数具有二阶光滑性时，一阶偏差会自动抵消，使得基础NPG无需MLMC即可实现相同的 $\widetilde{\mathcal{O}}(ε^{-2})$ 收敛速率。本研究首次为策略梯度方法下的凹多目标强化学习提供了最优样本复杂度保证。