Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that the optimal class-to-sample ratio (#classes / #samples per class) is invariant to the size of the pre-training dataset, which motivates an application of predicting the optimal number of pre-training classes. We demonstrate the effectiveness of this application by an improvement of around 2 points on the downstream tasks when using ImageNet as the pre-training dataset.
翻译:预训练数据集对于构建最先进的机器学习模型至关重要,这促使人们对其在下游任务中的影响进行严谨研究。本文研究了有监督预训练数据集中类内多样性(每类样本数)与类间多样性(类别数)之间权衡的影响。实验发现,在预训练数据集规模固定的情况下,最佳下游性能出现在类内/类间多样性达到平衡时。为理解其内在机制,我们从理论上证明了下游性能对这两种多样性均呈单调依赖关系。值得注意的是,我们的理论揭示最优类别-样本比(类别数/每类样本数)与预训练数据集规模无关,这启发了一种预测最优预训练类别数的应用方法。以ImageNet作为预训练数据集时,该方法在下游任务上实现了约2个百分点的性能提升,验证了其有效性。