Large web crawl datasets have already played an important role in learning multimodal features with high generalization capabilities. However, there are still very limited studies investigating the details or improvements of data design. Recently, a DataComp challenge has been designed to propose the best training data with the fixed models. This paper presents our solution to both filtering track and BYOD track of the DataComp challenge. Our solution adopts large multimodal models CLIP and BLIP-2 to filter and modify web crawl data, and utilize external datasets along with a bag of tricks to improve the data quality. Experiments show our solution significantly outperforms DataComp baselines (filtering track: 6.6% improvement, BYOD track: 48.5% improvement).
翻译:大规模网络爬取数据集在学习具有高泛化能力的多模态特征方面已发挥重要作用。然而,针对数据设计细节或改进方案的研究仍十分有限。近期,DataComp挑战赛旨在固定模型的基础上提出最优训练数据方案。本文介绍了我们在DataComp挑战赛过滤赛道与BYOD赛道中的解决方案。该方法采用大规模多模态模型CLIP与BLIP-2对网络爬取数据进行筛选与修正,并引入外部数据集及一系列改进技巧以提升数据质量。实验表明,我们的方案显著优于DataComp基线结果(过滤赛道提升6.6%,BYOD赛道提升48.5%)。