Reinforcement Learning (RL) has achieved remarkable success in safety-critical areas, but it can be weakened by adversarial attacks. Recent studies have introduced "smoothed policies" in order to enhance its robustness. Yet, it is still challenging to establish a provable guarantee to certify the bound of its total reward. Prior methods relied primarily on computing bounds using Lipschitz continuity or calculating the probability of cumulative reward above specific thresholds. However, these techniques are only suited for continuous perturbations on the RL agent's observations and are restricted to perturbations bounded by the l_2-norm. To address these limitations, this paper proposes a general black-box certification method capable of directly certifying the cumulative reward of the smoothed policy under various $l_p$-norm bounded perturbations. Furthermore, we extend our methodology to certify perturbations on action spaces. Our approach leverages f-divergence to measure the distinction between the original distribution and the perturbed distribution, subsequently determining the certification bound by solving a convex optimisation problem. We provide a comprehensive theoretical analysis and run sufficient experiments in multiple environments. Our results show that our method not only improves the certified lower bound of mean cumulative reward but also demonstrates better efficiency than state-of-the-art techniques.
翻译:强化学习在安全关键领域取得了显著成功,但易受对抗攻击的削弱。最新研究引入"平滑策略"以增强其鲁棒性,然而,建立可证明的保证来认证其总奖励边界仍具挑战性。先前方法主要依赖利用Lipschitz连续性计算边界,或计算累积奖励超过特定阈值的概率,但这些技术仅适用于对RL智能体观测的连续扰动,且限制在l_2-范数有界扰动范围内。为解决这些局限,本文提出一种通用的黑盒认证方法,能够直接认证平滑策略在多种$l_p$-范数有界扰动下的累积奖励。此外,我们将方法扩展至动作空间的扰动认证。我们的方法利用f-散度度量原始分布与扰动分布之间的差异,进而通过求解凸优化问题确定认证边界。我们提供了全面的理论分析,并在多个环境中开展了充分实验。结果表明,我们的方法不仅提升了平均累积奖励的认证下界,其效率也优于现有最先进技术。