Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promising results, their ability to generalize across unseen specifications and diverse environments remains insufficiently understood. In this work, we introduce SpecRLBench, a benchmark designed to evaluate the generalization capabilities of LTL-based specification-guided RL methods. The benchmark spans multiple difficulty levels across navigation and manipulation domains, incorporating both static and dynamic environments, diverse robot dynamics, and varied observation modalities. Through extensive empirical evaluation, we characterize the strengths and limitations of existing approaches and reveal the challenges that emerge as specification and environment complexity increase. SpecRLBench provides a structured platform for systematic comparison and supports the development of more generalizable specification-guided RL methods. Code is available at https://github.com/BU-DEPEND-Lab/SpecRLBench.
翻译:规格引导强化学习(RL)提供了一种基于形式化规格(如线性时序逻辑LTL)对复杂、时间扩展任务进行编码的原则性框架。尽管近期方法展现出有前景的结果,但其在未见规格及多样化环境中的泛化能力仍尚未充分理解。本文提出SpecRLBench——一个专门用于评估基于LTL的规格引导强化学习方法泛化能力的基准测试。该基准测试覆盖导航与操作领域的多个难度层级,包含静态与动态环境、多样化机器人动力学特性及多种观测模态。通过广泛的实证评估,我们刻画了现有方法的优势与局限,并揭示了随规格与环境复杂度提升而涌现的挑战。SpecRLBench为系统性比较提供了结构化平台,并支持开发更具泛化能力的规格引导强化学习方法。代码已开源:https://github.com/BU-DEPEND-Lab/SpecRLBench。