We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. https://github.com/x-cls/superclass
翻译:我们提出了SuperClass,一种用于图像-文本数据视觉-语言预训练的超级简单分类方法。与需要文本编码器进行对比的对比方法CLIP不同,SuperClass直接使用标记化的原始文本作为监督分类标签,无需额外的文本过滤或选择。由于无需文本编码作为对比目标,SuperClass不需要文本编码器,也无需像CLIP那样维持大批次训练。SuperClass在多种下游任务中展现出卓越性能,包括经典计算机视觉基准测试和视觉语言下游任务。我们进一步探索了SuperClass在模型规模、训练时长或数据量方面的扩展行为,并报告了与CLIP相比令人鼓舞的结果和对比分析。https://github.com/x-cls/superclass