We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. https://github.com/x-cls/superclass
翻译:我们提出了一种超简单的分类方法SuperClass,用于图像-文本数据的视觉-语言预训练。与需要文本编码器进行对比的CLIP不同,SuperClass直接使用标记化的原始文本作为监督分类标签,无需额外的文本过滤或选择。由于不需要文本编码作为对比目标,SuperClass既不需要文本编码器,也无需像CLIP那样维持大批次训练。SuperClass在多种下游任务中展现出卓越性能,包括经典计算机视觉基准测试和视觉语言下游任务。我们进一步探究了SuperClass在模型规模、训练时长或数据量上的扩展特性,并报告了与CLIP相比令人鼓舞的结果和对比分析。https://github.com/x-cls/superclass