Data point selection (DPS) is becoming a critical topic in deep learning due to the ease of acquiring uncurated training data compared to the difficulty of obtaining curated or processed data. Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation, which is demanding in terms of memory and computation, and exhibits some theoretical defects regarding minibatches. Thus, we propose a novel Bayesian approach to DPS. We view the DPS problem as posterior inference in a novel Bayesian model where the posterior distributions of the instance-wise weights and the main neural network parameters are inferred under a reasonable prior and likelihood model. We employ stochastic gradient Langevin MCMC sampling to learn the main network and instance-wise weights jointly, ensuring convergence even with minibatches. Our update equation is comparable to the widely used SGD and much more efficient than existing BLO-based methods. Through controlled experiments in both the vision and language domains, we present the proof-of-concept. Additionally, we demonstrate that our method scales effectively to large language models and facilitates automated per-task optimization for instruction fine-tuning datasets.
翻译:数据点选择正成为深度学习领域的关键议题,原因在于获取未经筛选的训练数据远比获得经过筛选或处理的数据更为容易。现有的数据点选择方法主要基于双层优化框架,这种方法对内存和计算资源要求较高,且在理论层面存在关于小批量处理的缺陷。为此,我们提出了一种新颖的贝叶斯数据点选择方法。我们将数据点选择问题视为一种新型贝叶斯模型中的后验推断问题,其中实例级权重与主神经网络参数的后验分布是在合理的先验分布与似然模型下进行推断的。我们采用随机梯度朗之万MCMC采样方法联合学习主网络与实例级权重,确保即使在小批量条件下也能收敛。我们的更新方程与广泛使用的随机梯度下降法具有可比性,且比现有基于双层优化的方法更为高效。通过在视觉与语言领域的受控实验,我们展示了该方法的原理验证。此外,我们证明了该方法能有效扩展至大型语言模型,并为指令微调数据集实现自动化的逐任务优化。