Keeping risk under control is often more crucial than maximizing expected rewards in real-world decision-making situations, such as finance, robotics, autonomous driving, etc. The most natural choice of risk measures is variance, which penalizes the upside volatility as much as the downside part. Instead, the (downside) semivariance, which captures the negative deviation of a random variable under its mean, is more suitable for risk-averse proposes. This paper aims at optimizing the mean-semivariance (MSV) criterion in reinforcement learning w.r.t. steady reward distribution. Since semivariance is time-inconsistent and does not satisfy the standard Bellman equation, the traditional dynamic programming methods are inapplicable to MSV problems directly. To tackle this challenge, we resort to Perturbation Analysis (PA) theory and establish the performance difference formula for MSV. We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function. Further, we propose two on-policy algorithms based on the policy gradient theory and the trust region method. Finally, we conduct diverse experiments from simple bandit problems to continuous control tasks in MuJoCo, which demonstrate the effectiveness of our proposed methods.
翻译:在金融、机器人、自动驾驶等现实决策场景中,控制风险往往比最大化期望回报更为关键。最常用的风险度量指标是方差,但方差对上行波动和下行波动的惩罚程度相同。相比之下,(下行)半方差仅衡量随机变量低于其均值的负向偏差,更适用于风险厌恶场景。本文旨在强化学习框架下针对稳态奖励分布优化均值-半方差准则。由于半方差具有时间不一致性且不满足标准贝尔曼方程,传统动态规划方法无法直接应用于均值-半方差问题。为克服这一挑战,我们借助扰动分析理论建立了均值-半方差性能差分公式,揭示了均值-半方差问题可通过迭代求解一系列具有策略相关奖励函数的强化学习问题来解决。进一步,我们提出了基于策略梯度理论和置信域方法的两种同策略算法。最后,我们在从简单赌博机问题到MuJoCo连续控制任务的多类实验中验证了所提方法的有效性。