We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite
翻译:我们提出CLIP-Lite,一种通过特征与文本标注对齐实现视觉表征学习的信息高效方法。与先前提出的CLIP模型相比,CLIP-Lite在对比学习目标优化过程中,每个正样本图像-文本对仅需一个负样本图像-文本对。我们通过利用一种信息高效的下界来最大化两种输入模态之间的互信息,从而实现这一目标。这使得CLIP-Lite能够在显著减少数据量和批次大小的情况下进行训练,同时在同等规模下获得优于CLIP的性能。我们通过在COCO-Captions数据集上进行预训练并在其他数据集上测试迁移学习来评估CLIP-Lite。CLIP-Lite在Pascal VOC分类任务上实现了绝对mAP提升+14.0%,在ImageNet上top-1准确率提升+22.1%,同时与其他更复杂的文本监督模型性能相当或更优。CLIP-Lite在图像与文本检索、零样本分类及视觉定位任务上也优于CLIP。最后,我们证明CLIP-Lite能利用语言语义促进生成无偏视觉表征,从而应用于下游任务。实现代码:https://github.com/4m4n5/CLIP-Lite