Aiming at promoting the safe real-world deployment of Reinforcement Learning (RL), research on safe RL has made significant progress in recent years. However, most existing works in the literature still focus on the online setting where risky violations of the safety budget are likely to be incurred during training. Besides, in many real-world applications, the learned policy is required to respond to dynamically determined safety budgets (i.e., constraint threshold) in real time. In this paper, we target at the above real-time budget constraint problem under the offline setting, and propose Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that approaches this problem from the perspective of trajectory distribution. Theoretically, we prove an error bound of the estimation on the episodic reward and cost under the offline setting and thus provide a performance guarantee for TREBI. Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.
翻译:摘要:为促进强化学习(RL)在现实世界中的安全部署,近年来安全RL研究取得了显著进展。然而,现有文献中的大多数工作仍聚焦于在线设置,在此类设置下训练过程中可能发生违反安全预算的风险行为。此外,在许多实际应用中,所学策略需要实时响应动态确定的安全预算(即约束阈值)。本文针对离线设置下的上述实时预算约束问题,提出了一种基于轨迹的实时预算推断方法(TREBI),该方法从轨迹分布的角度解决此问题。理论上,我们证明了离线设置下回合奖励与成本估计的误差界,从而为TREBI提供了性能保障。在广泛模拟任务及真实大规模广告应用上的实验结果表明,TREBI能够有效解决离线设置下的实时预算约束问题。