Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods, such as sensitivity to numerical scale and hindering of policy learning, and propose to use an alternative risk measure, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. Empirical evaluation in domains where risk-aversion can be clearly defined, shows that our algorithm can mitigate the limitations of variance-based risk measures and achieves high return with low risk in terms of variance and Gini deviation when others fail to learn a reasonable policy.
翻译:限制策略回报的方差是风险厌恶强化学习中的常见选择,因其具有清晰的数学定义和易于解释的特性。传统方法直接约束总回报方差,而近期方法则以每步奖励方差作为代理变量进行约束。我们深入分析了这些基于方差的方法存在的局限性,例如对数值尺度的敏感性以及对策略学习的阻碍作用,并提出采用替代风险度量——基尼偏差。我们研究该新型风险度量的多种性质,并推导出用于最小化该度量的策略梯度算法。在风险厌恶可被明确定义的领域中进行实证评估表明,当其他方法无法学习到合理策略时,我们的算法能够缓解基于方差的风险度量的局限性,在方差和基尼偏差两个维度上实现高回报与低风险。