Transparency and security are both central to Responsible AI, but they may conflict in adversarial settings. We investigate the strategic effect of transparency for agents through the lens of transferable adversarial example attacks. In transferable adversarial example attacks, attackers maliciously perturb their inputs using surrogate models to fool a defender's target model. These models can be defended or undefended, with both players having to decide which to use. Using a large-scale empirical evaluation of nine attacks across 181 models, we find that attackers are more successful when they match the defender's decision; hence, obscurity could be beneficial to the defender. With game theory, we analyze this trade-off between transparency and security by modeling this problem as both a Nash game and a Stackelberg game, and comparing the expected outcomes. Our analysis confirms that only knowing whether a defender's model is defended or not can sometimes be enough to damage its security. This result serves as an indicator of the general trade-off between transparency and security, suggesting that transparency in AI systems can be at odds with security. Beyond adversarial machine learning, our work illustrates how game-theoretic reasoning can uncover conflicts between transparency and security.
翻译:透明度与安全性均是负责任人工智能的核心要素,但在对抗性环境中二者可能存在冲突。本文通过可迁移对抗样本攻击的视角,探究透明度对智能体的策略性影响。在可迁移对抗样本攻击中,攻击者利用代理模型恶意扰动其输入,以欺骗防御者的目标模型。这些模型可能处于防御或未防御状态,双方均需决定采用何种模型。通过对181个模型上的九种攻击进行大规模实证评估,我们发现当攻击者与防御者的决策相匹配时,攻击成功率更高;因此,对防御者而言,保持模型的不透明性可能更为有利。借助博弈论,我们将该问题建模为纳什博弈与斯塔克尔伯格博弈,通过比较预期结果来分析透明度与安全性之间的权衡。分析证实,仅获知防御者模型是否处于防御状态,有时便足以对其安全性造成损害。这一结果揭示了透明度与安全性之间的普遍权衡关系,表明人工智能系统的透明度可能与安全性目标相冲突。除对抗性机器学习外,本研究亦展示了博弈论推理如何揭示透明度与安全性之间的内在矛盾。