Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods, such as sensitivity to numerical scale and hindering of policy learning, and propose to use an alternative risk measure, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. Empirical evaluation in domains where risk-aversion can be clearly defined, shows that our algorithm can mitigate the limitations of variance-based risk measures and achieves high return with low risk in terms of variance and Gini deviation when others fail to learn a reasonable policy.
翻译:在风险厌恶强化学习中,限制策略回报的方差因其数学定义清晰且易于解释而成为一种流行的选择。传统方法直接限制总回报方差,而近期方法则限制每步奖励方差作为代理。我们深入研究了这些基于方差的方法的局限性,例如对数值尺度的敏感性和对策略学习的阻碍,并提出使用一种替代风险度量——基尼偏差作为替代方案。我们研究了这一新风险度量的各种性质,并推导出最小化它的策略梯度算法。在风险厌恶可明确定义的领域进行的实证评估表明,我们的算法能够缓解基于方差的风险度量的局限性,并在其他方法无法学习到合理策略时,在方差和基尼偏差方面实现高回报与低风险。