Safe reinforcement learning (RL) requires the agent to finish a given task while obeying specific constraints. Giving constraints in natural language form has great potential for practical scenarios due to its flexible transfer capability and accessibility. Previous safe RL methods with natural language constraints typically need to design cost functions manually for each constraint, which requires domain expertise and lacks flexibility. In this paper, we harness the dual role of text in this task, using it not only to provide constraint but also as a training signal. We introduce the Trajectory-level Textual Constraints Translator (TTCT) to replace the manually designed cost function. Our empirical results demonstrate that TTCT effectively comprehends textual constraint and trajectory, and the policies trained by TTCT can achieve a lower violation rate than the standard cost function. Extra studies are conducted to demonstrate that the TTCT has zero-shot transfer capability to adapt to constraint-shift environments.
翻译:安全强化学习要求智能体在完成给定任务的同时遵守特定约束。以自然语言形式提供约束因其灵活的迁移能力和易用性在实际场景中具有巨大潜力。先前基于自然语言约束的安全强化学习方法通常需要为每个约束手动设计代价函数,这既需要领域专业知识又缺乏灵活性。本文利用文本在此任务中的双重作用,不仅将其用于提供约束,还将其作为训练信号。我们引入轨迹级文本约束翻译器以替代手动设计的代价函数。实验结果表明,TTCT能有效理解文本约束与轨迹,且通过TTCT训练的策略能实现比标准代价函数更低的违规率。额外研究证实TTCT具备零样本迁移能力,可适应约束迁移环境。