Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are learnable, via the lens of the classic PAC (Probably Approximately Correct) learning framework. We motivate several performative scenarios, accounting in particular for linear shifts in the label distribution, as well as for more general changes in both the labels and the features. We construct a performative empirical risk function, which depends only on data from the original distribution and on the type performative effect, and is yet an unbiased estimate of the true risk of a classifier on the shifted distribution. Minimizing this notion of performative risk allows us to show that any PAC-learnable hypothesis space in the standard binary classification setting remains PAC-learnable for the considered performative scenarios. We also conduct an extensive experimental evaluation of our performative risk minimization method and showcase benefits on synthetic and real data.
翻译:随着机器学习模型在现实世界应用中的广泛采用,可执行性现象——即模型依赖的测试分布偏移——正变得越来越普遍。遗憾的是,由于模型通常仅基于原始(未偏移)分布的样本进行训练,这种可执行性偏移可能导致测试时性能下降。本文通过经典的PAC(概率近似正确)学习框架,研究了可执行性二元分类问题是否以及何时是可学习的。我们提出了几种可执行性场景,特别考虑了标签分布的线性偏移,以及标签和特征中更一般的变化。我们构建了一个可执行性经验风险函数,该函数仅依赖于原始分布的数据和可执行性效应的类型,却能够无偏估计分类器在偏移分布上的真实风险。通过最小化这种可执行性风险的概念,我们证明了在标准二元分类设置中任何PAC可学习的假设空间,在所考虑的可执行性场景下仍然保持PAC可学习性。我们还对我们的可执行性风险最小化方法进行了广泛的实验评估,并在合成数据和真实数据上展示了其优势。