Random subspace method has wide security applications such as providing certified defenses against adversarial and backdoor attacks, and building robustly aligned LLM against jailbreaking attacks. However, the explanation of random subspace method lacks sufficient exploration. Existing state-of-the-art feature attribution methods, such as Shapley value and LIME, are computationally impractical and lacks security guarantee when applied to random subspace method. In this work, we propose EnsembleSHAP, an intrinsically faithful and secure feature attribution for random subspace method that reuses its computational byproducts. Specifically, our feature attribution method is 1) computationally efficient, 2) maintains essential properties of effective feature attribution (such as local accuracy), and 3) offers guaranteed protection against privacy-preserving attacks on feature attribution methods. To the best of our knowledge, this is the first work to establish provable robustness against explanation-preserving attacks. We also perform comprehensive evaluations for our explanation's effectiveness when faced with different empirical attacks, including backdoor attacks, adversarial attacks, and jailbreak attacks. The code is at https://github.com/Wang-Yanting/EnsembleSHAP. WARNING: This document may include content that could be considered harmful.
翻译:随机子空间方法在安全领域具有广泛应用,例如提供对抗对抗攻击和后门攻击的可认证防御,以及构建针对越狱攻击的鲁棒对齐大语言模型。然而,随机子空间方法的可解释性尚未得到充分探索。现有最先进的特征归因方法(如Shapley值和LIME)在应用于随机子空间方法时存在计算不可行且缺乏安全保障的问题。本文提出EnsembleSHAP,一种面向随机子空间方法的内在忠实且安全的特征归因方法,该方法通过复用随机子空间方法的计算副产品实现。具体而言,我们的特征归因方法具有以下特性:1) 计算高效,2) 保持有效特征归因的核心属性(如局部准确性),3) 为特征归因方法提供针对隐私攻击的保障性防护。据我们所知,这是首个建立可证明鲁棒性以抵御解释保存攻击的研究工作。我们还针对方法在面对不同实证攻击(包括后门攻击、对抗攻击和越狱攻击)时的有效性进行了全面评估。代码见https://github.com/Wang-Yanting/EnsembleSHAP。警告:本文档可能包含被认为有害的内容。