Watermarking Counterfactual Explanations

The field of Explainable Artificial Intelligence (XAI) focuses on techniques for providing explanations to end-users about the decision-making processes that underlie modern-day machine learning (ML) models. Within the vast universe of XAI techniques, counterfactual (CF) explanations are often preferred by end-users as they help explain the predictions of ML models by providing an easy-to-understand & actionable recourse (or contrastive) case to individual end-users who are adversely impacted by predicted outcomes. However, recent studies have shown significant security concerns with using CF explanations in real-world applications; in particular, malicious adversaries can exploit CF explanations to perform query-efficient model extraction attacks on proprietary ML models. In this paper, we propose a model-agnostic watermarking framework (for adding watermarks to CF explanations) that can be leveraged to detect unauthorized model extraction attacks (which rely on the watermarked CF explanations). Our novel framework solves a bi-level optimization problem to embed an indistinguishable watermark into the generated CF explanation such that any future model extraction attacks that rely on these watermarked CF explanations can be detected using a null hypothesis significance testing (NHST) scheme, while ensuring that these embedded watermarks do not compromise the quality of the generated CF explanations. We evaluate this framework's performance across a diverse set of real-world datasets, CF explanation methods, and model extraction techniques, and show that our watermarking detection system can be used to accurately identify extracted ML models that are trained using the watermarked CF explanations. Our work paves the way for the secure adoption of CF explanations in real-world applications.

翻译：可解释人工智能（XAI）领域致力于为终端用户提供关于现代机器学习（ML）模型决策过程的解释技术。在XAI技术的广阔范畴中，反事实（CF）解释常受终端用户青睐，因其通过为受预测结果不利影响的个体用户提供易于理解且可操作的补救（或对比）案例，有效解释ML模型的预测行为。然而，近期研究表明，在实际应用中使用CF解释存在严重安全隐患：恶意攻击者可能利用CF解释对专有ML模型实施查询高效的模型提取攻击。本文提出一种模型无关的水印框架（用于向CF解释添加水印），可借以检测依赖水印化CF解释的未授权模型提取攻击。该创新框架通过求解双层优化问题，将不可区分的水印嵌入生成的CF解释中，使得未来任何依赖这些水印化CF解释的模型提取攻击均可通过零假设显著性检验（NHST）方案被检测，同时确保嵌入的水印不会损害生成CF解释的质量。我们在多样化真实数据集、CF解释方法与模型提取技术上评估该框架性能，结果表明水印检测系统能准确识别通过水印化CF解释训练的提取模型。本研究为CF解释在实际应用中的安全部署开辟了道路。