CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique image-text pair in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.
翻译:CLIP证明了在无须显式训练的情况下,对齐视觉与语言空间是解决许多视觉任务的关键,但需要在海量数据集上从头训练图像和文本编码器。LiT通过仅训练文本编码器并使用预训练的视觉网络改进了这一点。本文证明,利用单域编码器(无论是否经过监督训练)和少量图像-文本对,可以在完全无须训练的情况下构建公共空间。此外,我们的模型具有独特性质。最值得注意的是,更新带有新训练样本的模型版本可在数秒内完成。同时,公共空间中的表示易于解释,因为每个维度对应于输入与多模态数据集中唯一图像-文本对的相似度。在标准零样本视觉基准上的实验展示了图像-文本模型的典型迁移能力。总体而言,我们的方法为基础多模态模型提供了一个简单却出乎意料强大的基线,引发了关于其数据效率以及检索在机器学习中角色的重要问题。