Safe reinforcement learning (RL) agents accomplish given tasks while adhering to specific constraints. Employing constraints expressed via easily-understandable human language offers considerable potential for real-world applications due to its accessibility and non-reliance on domain expertise. Previous safe RL methods with natural language constraints typically adopt a recurrent neural network, which leads to limited capabilities when dealing with various forms of human language input. Furthermore, these methods often require a ground-truth cost function, necessitating domain expertise for the conversion of language constraints into a well-defined cost function that determines constraint violation. To address these issues, we proposes to use pre-trained language models (LM) to facilitate RL agents' comprehension of natural language constraints and allow them to infer costs for safe policy learning. Through the use of pre-trained LMs and the elimination of the need for a ground-truth cost, our method enhances safe policy learning under a diverse set of human-derived free-form natural language constraints. Experiments on grid-world navigation and robot control show that the proposed method can achieve strong performance while adhering to given constraints. The usage of pre-trained LMs allows our method to comprehend complicated constraints and learn safe policies without the need for ground-truth cost at any stage of training or evaluation. Extensive ablation studies are conducted to demonstrate the efficacy of each part of our method.
翻译:安全强化学习(RL)智能体在遵守特定约束的同时完成给定任务。采用易于理解的人类语言表达的约束,因其易于获取且不依赖领域专业知识,在现实应用中具有巨大潜力。以往采用自然语言约束的安全强化学习方法通常依赖循环神经网络,这导致其在处理多种形式的人类语言输入时能力有限。此外,这些方法常需要真实成本函数,而将语言约束转化为定义明确的约束违反判定成本函数需具备领域专业知识。为解决这些问题,我们提出利用预训练语言模型(LM)增强RL智能体对自然语言约束的理解能力,使其能够推断安全策略学习所需的成本。通过使用预训练语言模型并消除对真实成本的需求,我们的方法在多样化人类自由形式自然语言约束下显著提升了安全策略学习效果。在网格世界导航与机器人控制实验表明,所提方法能在遵守给定约束的同时实现优异性能。预训练语言模型的使用使我们的方法能够理解复杂约束,并在训练或评估的任何阶段无需真实成本即可学习安全策略。通过广泛的消融研究验证了方法各组成部分的有效性。