In many real-world applications, safety constraints for reinforcement learning (RL) algorithms are either unknown or not explicitly defined. We propose a framework that concurrently learns safety constraints and optimal RL policies in such environments, supported by theoretical guarantees. Our approach merges a logically-constrained RL algorithm with an evolutionary algorithm to synthesize signal temporal logic (STL) specifications. The framework is underpinned by theorems that establish the convergence of our joint learning process and provide error bounds between the discovered policy and the true optimal policy. We showcased our framework in grid-world environments, successfully identifying both acceptable safety constraints and RL policies while demonstrating the effectiveness of our theorems in practice.
翻译:在诸多实际应用中,强化学习算法的安全约束要么未知,要么未明确界定。我们提出一个框架,能够在具备理论保证的前提下,在此类环境中同时学习安全约束与最优强化学习策略。该方法将逻辑约束强化学习算法与进化算法相结合,用于合成信号时序逻辑规范。该框架以一系列定理为基础,这些定理确立了联合学习过程的收敛性,并给出了所发现策略与真实最优策略之间的误差界。我们在网格世界环境中展示了该框架,成功识别出可接受的安全约束与强化学习策略,同时验证了相关理论定理在实践中的有效性。