Modern deep learning heavily relies on large labeled datasets, which often comse with high costs in terms of both manual labeling and computational resources. To mitigate these challenges, researchers have explored the use of informative subset selection techniques, including coreset selection and active learning. Specifically, coreset selection involves sampling data with both input ($\bx$) and output ($\by$), active learning focuses solely on the input data ($\bx$). In this study, we present a theoretically optimal solution for addressing both coreset selection and active learning within the context of linear softmax regression. Our proposed method, COPS (unCertainty based OPtimal Sub-sampling), is designed to minimize the expected loss of a model trained on subsampled data. Unlike existing approaches that rely on explicit calculations of the inverse covariance matrix, which are not easily applicable to deep learning scenarios, COPS leverages the model's logits to estimate the sampling ratio. This sampling ratio is closely associated with model uncertainty and can be effectively applied to deep learning tasks. Furthermore, we address the challenge of model sensitivity to misspecification by incorporating a down-weighting approach for low-density samples, drawing inspiration from previous works. To assess the effectiveness of our proposed method, we conducted extensive empirical experiments using deep neural networks on benchmark datasets. The results consistently showcase the superior performance of COPS compared to baseline methods, reaffirming its efficacy.
翻译:现代深度学习高度依赖于大规模标注数据集,这通常伴随着高昂的人力标注和计算资源成本。为缓解这些挑战,研究者探索了信息性子集选择技术,包括核心集选择与主动学习。具体而言,核心集选择涉及同时对输入($\bx$)和输出($\by$)进行采样,而主动学习仅关注输入数据($\bx$)。在本研究中,我们在线性softmax回归场景下提出了一种理论上最优的解决方案,同时适用于核心集选择与主动学习。所提出的方法COPS(基于不确定性的最优子采样)旨在最小化在子采样数据上训练模型的期望损失。与现有依赖显式计算协方差逆矩阵(难以应用于深度学习场景)的方法不同,COPS利用模型logits估计采样比率。该采样比率与模型不确定性密切相关,并能有效应用于深度学习任务。此外,受先前工作启发,我们通过引入针对低密度样本的降权方法来应对模型对误设定的敏感性。为评估所提方法的有效性,我们在基准数据集上使用深度神经网络进行了广泛实证实验。结果表明,COPS相较于基线方法始终展现出更优性能,进一步验证了其有效性。