We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constraints directly. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ the one-sided Chebyshev inequality to obtain a tractable surrogate based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide rigorous worst-case bounds for both policy improvement and constraint violation during the training process.
翻译:本文提出风险价值约束策略优化算法(VaR-CPO),这是一种样本高效且保守的方法,旨在直接优化风险价值(VaR)约束。实验表明,VaR-CPO能够实现安全探索,在可行环境中训练期间达到零约束违反,这一关键特性是基线方法无法保持的。为克服VaR约束固有的不可微性,我们采用单边切比雪夫不等式,基于成本回报的前两阶矩获得可处理的替代函数。此外,通过扩展约束策略优化(CPO)方法的信赖域框架,我们为训练过程中的策略改进和约束违反提供了严格的最坏情况边界。