T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS.

翻译：大规模网络来源的多模态数据集推动了通用视觉表示学习新方法的涌现，提升了计算机视觉领域的技术水平，并革新了零样本和少样本识别性能。实践者面临的关键决策在于如何（甚至是否）对这些日益庞大的数据集进行筛选。例如，LAION-5B数据集创建者仅保留CLIP相似度得分超过指定阈值的图像-文本对。本文提出一种新的先进数据筛选方法，其动机源于我们的发现：LAION中近40%的图像包含与文本描述高度重叠的文字。直观而言，这类数据可能造成资源浪费，因为它鼓励模型执行光学字符识别而非学习视觉特征。然而，简单移除所有此类数据同样会造成浪费，因为这会丢弃包含视觉特征（除重叠文字外）的图像。我们的简单可扩展方法T-MARS（文本掩码与重新评分）仅筛选出那些文本主导剩余视觉特征的图像-文本对——具体做法是：先对文本进行掩码处理，再过滤掉掩码图像CLIP相似度得分较低的配对。实验表明，在DataComp（数据筛选基准）的"中等规模"任务中，T-MARS在ImageNet上以6.5%的优势、在VTAB上以4.7%的优势超越排名第一的方法。此外，我们在2M至64M的不同数据池规模上的系统性评估显示，随着数据和计算量呈指数级增长，T-MARS所带来的精度提升呈线性增长。代码已开源至https://github.com/locuslab/T-MARS。