T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS.

翻译：大规模网络来源的多模态数据集推动了一系列用于学习通用视觉表征的新方法，这些方法提升了计算机视觉领域的技术水平，并彻底改变了零样本和小样本识别能力。研究者面临的关键决策之一是如何（以及是否）对这些日益庞大的数据集进行筛选。例如，LAION-5B数据集的创建者仅保留CLIP相似度得分超过指定阈值的图像-文本对。本文中，我们提出了一种新的最优数据筛选方法，其动机源于一个发现：LAION数据集中近40%的图像包含与标注文本高度重叠的文字。直观而言，此类数据可能具有浪费性，因为它促使模型执行光学字符识别而非学习视觉特征。然而，简单删除所有此类数据同样具有浪费性，因为这可能丢弃包含视觉特征（同时含有重叠文字）的图像。我们的方法T-MARS（文本掩码与重新评分）简洁且可扩展，通过先对文本区域进行掩码处理，再过滤掉掩码图像CLIP相似度得分较低的图像-文本对，仅剔除那些文字主导剩余视觉特征的数据对。实验表明，在DataComp（数据筛选基准）的“中等规模”任务中，T-MARS在ImageNet和VTAB上的表现分别超出排名第一的方法6.5%和4.7%。此外，我们在2M到64M多个数据池规模下的系统评估表明，随着数据和计算资源呈指数级扩展，T-MARS带来的精度提升呈现线性增长特征。代码已开源：https://github.com/locuslab/T-MARS。