We propose FLARE, the first fingerprinting mechanism to verify whether a suspected Deep Reinforcement Learning (DRL) policy is an illegitimate copy of another (victim) policy. We first show that it is possible to find non-transferable, universal adversarial masks, i.e., perturbations, to generate adversarial examples that can successfully transfer from a victim policy to its modified versions but not to independently trained policies. FLARE employs these masks as fingerprints to verify the true ownership of stolen DRL policies by measuring an action agreement value over states perturbed by such masks. Our empirical evaluations show that FLARE is effective (100% action agreement on stolen copies) and does not falsely accuse independent policies (no false positives). FLARE is also robust to model modification attacks and cannot be easily evaded by more informed adversaries without negatively impacting agent performance. We also show that not all universal adversarial masks are suitable candidates for fingerprints due to the inherent characteristics of DRL policies. The spatio-temporal dynamics of DRL problems and sequential decision-making process make characterizing the decision boundary of DRL policies more difficult, as well as searching for universal masks that capture the geometry of it.
翻译:我们提出FLARE,这是首个用于验证可疑深度强化学习(DRL)策略是否为另一(受害)策略非法副本的指纹识别机制。我们首先论证存在可寻找的非可迁移通用对抗掩码(即扰动),用于生成能够成功从受害策略迁移至其修改版本、但无法迁移至独立训练策略的对抗样本。FLARE采用这些掩码作为指纹,通过测量经此类掩码扰动状态下动作一致性值来验证被盗用DRL策略的真实所有权。实证评估表明,FLARE对被盗副本具有有效性(动作一致性达100%),且不会误判独立策略(零误报率)。FLARE对模型修改攻击具有鲁棒性,且即使更智能的对手也难以在不降低代理性能的情况下规避检测。我们还发现,由于DRL策略的固有特性,并非所有通用对抗掩码都适合作为指纹。DRL问题的时空动态特性与序列决策过程增加了刻画DRL策略决策边界的难度,也使得捕获其几何结构的通用掩码搜索更具挑战性。