Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these approaches can be effectively combined, but most notably their results use small pre-training datasets (<50M samples) and don't effectively reflect the large-scale regime (>100M examples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training.
翻译:自监督与自然语言监督已成为训练通用图像编码器的两种令人振奋的方法,这些编码器在多种下游任务中表现出色。近期研究(如M3AE和SLIP)表明,这两种方法可以高效结合,但值得注意的是,这些研究均使用小型预训练数据集(少于5000万样本),未能有效反映此类方法常用的大规模场景(超过1亿样本)。本文探究在数据量大幅增加时,类似方法是否仍能保持有效性。我们发现,在1130万图像-文本对语料上训练时,结合两种最先进方法——掩码自编码器(MAE)与对比语言-图像预训练(CLIP)——相较于单独使用CLIP能带来性能提升;但在14亿图像的大规模语料上训练时,这种结合对CLIP的提升微乎其微甚至毫无裨益(基于一系列常见视觉任务评估)。本研究为揭示自监督在大规模图像-文本训练中的有效性(或其缺失)提供了必要的清晰认识。