Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods, such as sensitivity to numerical scale and hindering of policy learning, and propose to use an alternative risk measure, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. Empirical evaluation in domains where risk-aversion can be clearly defined, shows that our algorithm can mitigate the limitations of variance-based risk measures and achieves high return with low risk in terms of variance and Gini deviation when others fail to learn a reasonable policy.
翻译:在风险规避强化学习中,由于策略收益方差具有清晰的数学定义和易于解释的特性,限制其方差是一种常见做法。传统方法直接限制总收益方差,而近期方法则通过约束单步奖励方差作为替代。我们深入分析了这些基于方差的方法存在的局限性,例如对数值尺度的敏感性和对策略学习的阻碍,并提出使用替代风险度量——基尼偏差——作为方差的新替代方案。我们研究了这种新型风险度量的多种性质,并推导出一种用于最小化该度量的策略梯度算法。在可明确定义风险规避的领域进行的实证评估表明,当其他方法无法学习到合理策略时,我们的算法能够缓解基于方差的风险度量存在的局限性,并在方差和基尼偏差两种度量下实现高收益与低风险。