Safe reinforcement learning (RL) agents accomplish given tasks while adhering to specific constraints. Employing constraints expressed via easily-understandable human language offers considerable potential for real-world applications due to its accessibility and non-reliance on domain expertise. Previous safe RL methods with natural language constraints typically adopt a recurrent neural network, which leads to limited capabilities when dealing with various forms of human language input. Furthermore, these methods often require a ground-truth cost function, necessitating domain expertise for the conversion of language constraints into a well-defined cost function that determines constraint violation. To address these issues, we proposes to use pre-trained language models (LM) to facilitate RL agents' comprehension of natural language constraints and allow them to infer costs for safe policy learning. Through the use of pre-trained LMs and the elimination of the need for a ground-truth cost, our method enhances safe policy learning under a diverse set of human-derived free-form natural language constraints. Experiments on grid-world navigation and robot control show that the proposed method can achieve strong performance while adhering to given constraints. The usage of pre-trained LMs allows our method to comprehend complicated constraints and learn safe policies without the need for ground-truth cost at any stage of training or evaluation. Extensive ablation studies are conducted to demonstrate the efficacy of each part of our method.
翻译:安全强化学习(Safe RL)智能体需在遵守特定约束的同时完成给定任务。采用易于理解的人类语言表述约束,因其易用性且无需领域专业知识,在现实应用中具有巨大潜力。现有基于自然语言约束的安全强化学习方法通常采用循环神经网络,这导致其在处理多种形式的人类语言输入时能力有限。此外,这些方法通常需要真实成本函数,需要领域专业知识将语言约束转化为精确定义约束违反的成本函数。为解决这些问题,本文提出利用预训练语言模型(LM)增强强化学习智能体对自然语言约束的理解能力,使其能推断安全策略学习所需的成本。通过使用预训练LM并消除对真实成本的需求,本方法在多样化的人类自由形式自然语言约束下提升了安全策略学习能力。在网格世界导航和机器人控制实验表明,所提方法能在遵守给定约束的同时实现优异性能。预训练LM的使用使本方法能理解复杂约束,并在训练或评估的任何阶段无需真实成本即可学习安全策略。通过广泛的消融研究验证了本方法各组成部分的有效性。