Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD'18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD'22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack's impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes.
翻译:学习索引是一类索引数据结构,它通过使用机器学习模型近似累积分布函数(CDF)来实现快速搜索(Kraska等人,SIGMOD'18)。然而,最近的研究表明,学习索引容易受到中毒攻击,即在训练数据中注入少量中毒键可以显著降低模型精度并损害索引性能(Kornaropoulos等人,SIGMOD'22)。在这项工作中,我们对针对CDF上线性回归模型的中毒攻击进行了严格的理论分析,线性回归是最基础的回归模型之一,也是许多学习索引的核心组件。我们的主要贡献如下:(i)我们提出了一个理论证明,刻画了最优单点中毒攻击,并表明现有方法能够产生最优攻击。(ii)我们证明了在多点攻击中,现有的贪心方法并非总是最优的,并且我们严格推导了最优攻击应满足的关键性质。(iii)我们提出了一种计算多点中毒攻击影响上界的方法,并通过实验证明贪心方法下的损失通常接近该上界。我们的研究深化了对CDF上线性回归模型攻击策略的理论理解,并为学习索引攻击与防御的理论评估奠定了基础。