We propose FLARE, the first fingerprinting mechanism to verify whether a suspected Deep Reinforcement Learning (DRL) policy is an illegitimate copy of another (victim) policy. We first show that it is possible to find non-transferable, universal adversarial masks, i.e., perturbations, to generate adversarial examples that can successfully transfer from a victim policy to its modified versions but not to independently trained policies. FLARE employs these masks as fingerprints to verify the true ownership of stolen DRL policies by measuring an action agreement value over states perturbed via such masks. Our empirical evaluations show that FLARE is effective (100% action agreement on stolen copies) and does not falsely accuse independent policies (no false positives). FLARE is also robust to model modification attacks and cannot be easily evaded by more informed adversaries without negatively impacting agent performance. We also show that not all universal adversarial masks are suitable candidates for fingerprints due to the inherent characteristics of DRL policies. The spatio-temporal dynamics of DRL problems and sequential decision-making process make characterizing the decision boundary of DRL policies more difficult, as well as searching for universal masks that capture the geometry of it.
翻译:我们提出FLARE——首个用于验证疑似深度强化学习策略是否为另一(受害者)策略非法复制品的指纹识别机制。首先证明存在不可迁移的通用对抗掩码(即扰动),这些扰动生成的对抗样本能够从受害者策略成功迁移至其修改版本,但无法迁移至独立训练的策略。FLARE利用这些掩码作为指纹,通过测量经此类掩码扰动状态下动作一致性指标,验证被盗DRL策略的真实所有权。实验评估表明,FLARE对被盗副本具有100%的动作一致性检测效果,且不会错误指控独立策略(零误报率)。该机制对模型修改攻击具有鲁棒性,即便是具备更多先验知识的攻击者也难以在保证智能体性能不受负面影响的前提下规避检测。我们还发现,由于DRL策略的内在特性,并非所有通用对抗掩码都适合作为指纹候选。DRL问题的时空动态特性与顺序决策过程,不仅增加了刻画DRL策略决策边界的难度,也使捕获其几何结构的通用掩码搜索更具挑战性。