Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.
翻译:利用策略梯度优化条件风险价值(CVaR)(亦称CVaR-PG)面临样本效率低下的显著挑战。这种低效性源于其仅关注尾部性能而忽略了大量采样轨迹。我们通过为CVaR增加一个期望分位数项来解决此问题。分位数优化可采用动态规划形式,能充分利用所有采样数据,从而提升样本效率。由于CVaR对应于尾部区域分位数的期望,该方法不会改变CVaR目标函数。在具有可验证风险规避行为的领域中的实证结果表明,我们基于马尔可夫策略类的算法相较于CVaR-PG有显著改进,且持续优于其他现有方法。