Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples $\{{\boldsymbol x}_i\}_{i\le N}$, and to be given access to a `surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by $\{{\boldsymbol x}_i\}_{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$~Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
翻译:给定一个样本量为$N$的样本,通常需要选择一个更小的子样本(规模$n<N$)用于统计估计或学习。此类数据选择步骤有助于降低数据标注需求和学习计算复杂度。我们假设已知$N$个未标记样本$\{{\boldsymbol x}_i\}_{i\le N}$,并有权访问一个比随机猜测更能准确预测标签$y_i$的“代理模型”。目标是选择一个子样本集,记为$\{{\boldsymbol x}_i\}_{i\in G}$,其规模$|G|=n<N$。我们为这个子集获取标签,并通过正则化经验风险最小化训练模型。结合基于真实数据和合成数据的数值实验,以及在低维和高维渐进条件下的数学推导,我们证明:$(i)$~数据选择可能非常有效,在某些情况下甚至优于基于完整样本的训练;$(ii)$~某些流行的数据选择方法(例如无偏重加权子采样或基于影响函数的子采样)可能显著次优。