Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples $\{{\boldsymbol x}_i\}_{i\le N}$, and to be given access to a `surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by $\{{\boldsymbol x}_i\}_{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$~Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
翻译:给定一个大小为$N$的样本集,通常需要选择一个更小的子样本集(大小为$n<N$)用于统计估计或学习。此类数据选择步骤有助于降低数据标注需求和学习计算复杂度。假设已知$N$个无标签样本$\{{\boldsymbol x}_i\}_{i\le N}$,并且可以访问一个“替代模型”,该模型能比随机猜测更好地预测标签$y_i$。我们的目标是选取一个大小为$|G|=n<N$的样本子集,记为$\{{\boldsymbol x}_i\}_{i\in G}$。我们获取该子集的标签,并基于这些标签通过正则化经验风险最小化训练模型。通过结合真实与合成数据的数值实验,以及在低维和高维渐近条件下的数学推导,我们证明:(i)数据选择可以非常有效,在某些情况下甚至优于基于完整样本集的训练;(ii)某些流行的数据选择方法(例如无偏重加权子采样或基于影响函数的子采样)可能严重次优。