Missing data is commonly encountered in practice, and when the missingness is non-ignorable, effective remediation depends on knowledge of the missingness mechanism. Learning the underlying missingness mechanism from the data is not possible in general, so adversaries can exploit this fact by maliciously engineering non-ignorable missingness mechanisms. Such Adversarial Missingness (AM) attacks have only recently been motivated and introduced, and then successfully tailored to mislead causal structure learning algorithms into hiding specific cause-and-effect relationships. However, existing AM attacks assume the modeler (victim) uses full-information maximum likelihood methods to handle the missing data, and are of limited applicability when the modeler uses different remediation strategies. In this work we focus on associational learning in the context of AM attacks. We consider (i) complete case analysis, (ii) mean imputation, and (iii) regression-based imputation as alternative strategies used by the modeler. Instead of combinatorially searching for missing entries, we propose a novel probabilistic approximation by deriving the asymptotic forms of these methods used for handling the missing entries. We then formulate the learning of the adversarial missingness mechanism as a bi-level optimization problem. Experiments on generalized linear models show that AM attacks can be used to change the p-values of features from significant to insignificant in real datasets, such as the California-housing dataset, while using relatively moderate amounts of missingness (<20%). Additionally, we assess the robustness of our attacks against defense strategies based on data valuation.
翻译:实践中常遇到数据缺失问题,当缺失机制不可忽略时,有效补救依赖于对缺失机制的了解。通常无法从数据中学习潜在的缺失机制,因此攻击者可恶意设计不可忽略的缺失机制来利用这一特性。此类对抗性缺失攻击虽近期才被提出并成功应用于误导因果结构学习算法以隐藏特定因果关系,但现有攻击假设建模者采用全信息最大似然法处理缺失数据,当建模者使用不同补救策略时其适用性受限。本研究聚焦于对抗性缺失攻击背景下的关联学习。我们考虑建模者采用的三种替代策略:(i)完整案例分析,(ii)均值插补,以及(iii)基于回归的插补。通过推导这些缺失数据处理方法的渐近形式,我们提出了一种新颖的概率逼近方法以替代组合搜索缺失条目。随后将对抗性缺失机制的学习构建为双层优化问题。在广义线性模型上的实验表明,对抗性缺失攻击可在真实数据集(如加州房价数据集)中将特征的p值从显著改变为不显著,且仅需相对适度的缺失比例(<20%)。此外,我们评估了所提出攻击对基于数据估值防御策略的鲁棒性。