A Model-free Closeness-of-influence Test for Features in Supervised Learning

Understanding the effect of a feature vector $x \in \mathbb{R}^d$ on the response value (label) $y \in \mathbb{R}$ is the cornerstone of many statistical learning problems. Ideally, it is desired to understand how a set of collected features combine together and influence the response value, but this problem is notoriously difficult, due to the high-dimensionality of data and limited number of labeled data points, among many others. In this work, we take a new perspective on this problem, and we study the question of assessing the difference of influence that the two given features have on the response value. We first propose a notion of closeness for the influence of features, and show that our definition recovers the familiar notion of the magnitude of coefficients in the parametric model. We then propose a novel method to test for the closeness of influence in general model-free supervised learning problems. Our proposed test can be used with finite number of samples with control on type I error rate, no matter the ground truth conditional law $\mathcal{L}(Y |X)$. We analyze the power of our test for two general learning problems i) linear regression, and ii) binary classification under mixture of Gaussian models, and show that under the proper choice of score function, an internal component of our test, with sufficient number of samples will achieve full statistical power. We evaluate our findings through extensive numerical simulations, specifically we adopt the datamodel framework (Ilyas, et al., 2022) for CIFAR-10 dataset to identify pairs of training samples with different influence on the trained model via optional black box training mechanisms.

翻译：理解特征向量 $x \in \mathbb{R}^d$ 对响应值（标签）$y \in \mathbb{R}$ 的影响是许多统计学习问题的基石。理想情况下，人们希望理解一组收集的特征如何共同作用并影响响应值，但由于数据的高维性、有限数量的标记数据点等诸多因素，这一问题极为困难。在本文中，我们对该问题采取了一种新的视角，研究评估两个给定特征对响应值影响的差异。我们首先提出了特征影响接近性的概念，并表明我们的定义恢复了参数模型中常见系数幅值概念。随后，我们提出了一种新颖方法，用于在一般无模型的监督学习问题中检验影响接近性。无论真实条件分布 $\mathcal{L}(Y |X)$ 如何，所提出的检验方法均可使用有限样本，并控制第一类错误率。我们分析了该方法在两类常见学习问题（i）线性回归和（ii）高斯混合模型下的二分类）中的统计功效，表明在适当选择评分函数（我们检验方法的一个内部组件）且样本量充足时，可达到完全统计功效。我们通过大量数值模拟评估了研究结果，具体采用CIFAR-10数据集上的数据模型框架（Ilyas 等，2022），通过可选的黑箱训练机制识别对训练模型具有不同影响的训练样本对。