Currently, data and model size dominate the narrative in the training of super-large, powerful models. However, there has been a lack of exploration on the effect of other attributes of the training dataset on model performance. We hypothesize that dataset diversity can impact the performance of vision models. Our study shows positive correlations between test set accuracy and data diversity, providing an argument for furthering the research of dataset attributes beyond size. We analyzed pre-training and model-agnostic meta-learning methods on twelve popular visual datasets (e.g., Omniglot, CIFAR-FS, Aircraft) and five model configurations, including MAML variants with different numbers of inner gradient steps and supervised learning. We show moderate to strong positive correlations (R-squared: 0.15-0.42) between accuracy and data diversity and weaker but significant correlations (R-squared: ~0.2) between loss and diversity. These findings support our hypothesis and demonstrate a promising way for a deeper exploration of how formal data diversity influences model performance. This initial study highlights the potential of (Task2Vec) data diversity as a valuable measure in the rapidly evolving field of large-scale learning and emphasizes that understanding the dataset is key to building more powerful and generalizable models.
翻译:当前,在超大规模、高性能模型的训练中,数据和模型规模占据了主导叙事。然而,关于训练数据集的其他属性对模型性能的影响,却缺乏深入探索。我们假设数据集多样性会影响视觉模型的性能。我们的研究表明,测试集准确率与数据多样性之间存在正相关关系,这为超越规模限制、进一步研究数据集属性提供了论据。我们在十二个流行的视觉数据集(例如Omniglot、CIFAR-FS、Aircraft)和五种模型配置上分析了预训练和模型无关元学习方法,包括具有不同内部梯度步数的MAML变体以及监督学习。我们展示了准确率与数据多样性之间存在中度至强正相关(R平方:0.15-0.42),而损失与多样性之间则存在较弱但仍显著的相关性(R平方:约0.2)。这些发现支持了我们的假设,并为深入探索形式化的数据多样性如何影响模型性能指明了一条有前景的路径。这项初步研究突显了(Task2Vec)数据多样性作为大规模学习这一快速发展领域中一项有价值衡量指标的潜力,并强调理解数据集是构建更强大、更具泛化能力模型的关键。