We present a policy optimization framework in which the learned policy comes with a machine-checkable certificate of adversarial robustness. Our approach, called CAROL, learns a model of the environment. In each learning iteration, it uses the current version of this model and an external abstract interpreter to construct a differentiable signal for provable robustness. This signal is used to guide policy learning, and the abstract interpretation used to construct it directly leads to the robustness certificate returned at convergence. We give a theoretical analysis that bounds the worst-case accumulative reward of CAROL. We also experimentally evaluate CAROL on four MuJoCo environments. On these tasks, which involve continuous state and action spaces, CAROL learns certified policies that have performance comparable to the (non-certified) policies learned using state-of-the-art robust RL methods.
翻译:我们提出了一种策略优化框架,在该框架中,学习到的策略附带一个可机器验证的对抗鲁棒性证书。我们的方法称为CAROL,它学习环境的一个模型。在每个学习迭代中,它利用该模型的当前版本和一个外部抽象解释器,为可证明的鲁棒性构建一个可微信号。该信号用于指导策略学习,而用于构建该信号的抽象解释直接导致了收敛时返回的鲁棒性证书。我们给出了一个理论分析,界定了CAROL的最坏情况累积奖励。我们还在四个MuJoCo环境中对CAROL进行了实验评估。在这些涉及连续状态和动作空间的任务中,CAROL学习到的经认证策略的性能与使用最先进鲁棒强化学习方法学习到的(未经认证的)策略相当。