Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach--multimodal contrastive learning with joint example selection (JEST)--surpasses state-of-the-art models with up to 13$\times$ fewer iterations and 10$\times$ less computation. Essential to the performance of JEST is the ability to steer the data selection process towards the distribution of smaller, well-curated datasets via pretrained reference models, exposing the level of data curation as a new dimension for neural scaling laws.
翻译:数据策展是大规模预训练的关键组成部分。本研究表明,联合选择数据批次比独立选择样本更有利于学习。多模态对比目标揭示了数据间的依赖关系,从而自然产生了衡量批次联合可学习性的准则。我们推导出一种简单易处理的算法来选择此类批次,相比独立优先化的数据点,该方法显著加速了训练过程。随着从更大超批次中进行选择带来的性能提升,我们还利用模型近似的最新进展来降低相关的计算开销。因此,我们的方法——联合样本选择的多模态对比学习(JEST)——以高达13倍的迭代次数减少和10倍的计算量减少超越了现有最优模型。JEST性能的关键在于能够通过预训练的参考模型将数据选择过程导向更小、精心策展的数据集分布,从而揭示了数据策展水平作为神经缩放定律的新维度。