Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.
翻译:强化学习(RL)策略可能表现出不安全行为且难以解释。我们利用反事实大语言模型推理来增强训练后强化学习策略的安全性。研究表明,我们的方法能够改进并有助于解释强化学习策略的安全性。