The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance without considering risk or safety. In contrast, safe reinforcement learning aims to mitigate or avoid unsafe states. This paper presents a risk-sensitive Q-learning algorithm that leverages optimal transport theory to enhance the agent safety. By integrating optimal transport into the Q-learning framework, our approach seeks to optimize the policy's expected return while minimizing the Wasserstein distance between the policy's stationary distribution and a predefined risk distribution, which encapsulates safety preferences from domain experts. We validate the proposed algorithm in a Gridworld environment. The results indicate that our method significantly reduces the frequency of visits to risky states and achieves faster convergence to a stable policy compared to the traditional Q-learning algorithm.
翻译:强化学习的主要目标是开发优先考虑最优性能而忽略风险或安全性的决策策略。相比之下,安全强化学习旨在减轻或避免不安全状态。本文提出了一种风险敏感Q学习算法,该算法利用最优传输理论来增强智能体的安全性。通过将最优传输融入Q学习框架,我们的方法旨在优化策略的期望回报,同时最小化策略的平稳分布与预定义风险分布之间的Wasserstein距离,该风险分布封装了领域专家的安全偏好。我们在Gridworld环境中验证了所提出的算法。结果表明,与传统Q学习算法相比,我们的方法显著降低了访问风险状态的频率,并实现了更快的稳定策略收敛。