The proliferation of large-scale datasets poses a major computational challenge to model training. The traditional data subsampling method works as a static, task independent preprocessing step which usually discards information that is critical to downstream prediction. In this paper, we introduces the antagonistic soft selection subsampling (ASSS) framework as is a novel paradigm that reconstructs data reduction into a differentiable end-to-end learning problem. ASSS uses the adversarial game between selector network and task network, and selector network learning assigns continuous importance weights to samples. This direct optimization implemented by Gumbel-Softmax relaxation allows the selector to identify and retain samples with the maximum amount of information for a specific task target under the guidance of the loss function that balances the fidelity and sparsity of the prediction. Theoretical analysis links this framework with the information bottleneck principle. Comprehensive experiments on four large-scale real world datasets show that ASSS has always been better than heuristic subsampling baselines such as clustering and nearest neighbor thinning in maintaining model performance. It is worth noting that ASSS can not only match, but also sometimes exceed the training performance of the entire dataset, showcasing the effect of intelligent denoising. This work establishes task aware data subsampling as a learnable component, providing a principled solution for effective large-scale data learning.
翻译:大规模数据集的激增给模型训练带来了巨大的计算挑战。传统的数据子采样方法作为一种静态的、与任务无关的预处理步骤,通常会丢弃对下游预测至关重要的信息。本文提出对抗性软选择子采样(ASSS)框架,作为一种新颖的范式,将数据约简重构为一个可微分的端到端学习问题。ASSS利用选择器网络与任务网络之间的对抗博弈,使选择器网络学习为样本分配连续的重要性权重。通过Gumbel-Softmax松弛实现的直接优化,允许选择器在平衡预测保真度与稀疏性的损失函数指导下,识别并保留对特定任务目标具有最大信息量的样本。理论分析将该框架与信息瓶颈原理联系起来。在四个大规模真实世界数据集上的综合实验表明,ASSS在保持模型性能方面始终优于启发式子采样基线方法(如聚类和最近邻稀疏化)。值得注意的是,ASSS不仅能够匹配,有时甚至能超越使用完整数据集的训练性能,展现了智能去噪的效果。本工作将任务感知数据子采样确立为一个可学习的组件,为有效的大规模数据学习提供了原则性解决方案。