This study investigates the mean-variance (MV) trade-off in reinforcement learning (RL), an instance of the sequential decision-making under uncertainty. Our objective is to obtain MV-efficient policies whose means and variances are located on the Pareto efficient frontier with respect to the MV trade-off; under the condition, any increase in the expected reward would necessitate a corresponding increase in variance, and vice versa. To this end, we propose a method that trains our policy to maximize the expected quadratic utility, defined as a weighted sum of the first and second moments of the rewards obtained through our policy. We subsequently demonstrate that the maximizer indeed qualifies as an MV-efficient policy. Previous studies that employed constrained optimization to address the MV trade-off have encountered computational challenges. However, our approach is more computationally efficient as it eliminates the need for gradient estimation of variance, a contributing factor to the double sampling issue observed in existing methodologies. Through experimentation, we validate the efficacy of our approach.
翻译:本研究探讨了强化学习(RL)中的均值-方差(MV)权衡问题,这是不确定性下序贯决策的一个实例。我们的目标是获得MV高效策略,其均值与方差位于关于MV权衡的帕累托有效前沿上;在此条件下,任何期望奖励的增加必然导致方差的相应增加,反之亦然。为此,我们提出一种方法,通过训练策略以最大化期望二次效用(定义为策略所获奖励的一阶矩与二阶矩的加权和)来实现这一目标。我们随后证明,该最大化器确实符合MV高效策略的条件。以往采用约束优化处理MV权衡的研究常面临计算挑战。然而,我们的方法在计算上更为高效,因为它无需对方差进行梯度估计——这正是现有方法中导致双重采样问题的因素之一。通过实验,我们验证了所提方法的有效性。