Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.
翻译:大规模多模态表示学习成功实现了测试时零样本迁移的优化。然而,标准预训练范式(基于大量图文数据的对比学习)并未显式鼓励表示支持少样本适应。本研究提出一种简单但精心设计的多模态预训练扩展方法,使表示能够容纳额外上下文信息。通过该目标函数,我们证明视觉语言模型可被训练为具备显著增强的少样本适应能力:在21个下游任务中,我们发现测试时样本效率最高提升四倍,平均少样本适应增益超过5%,同时在不同模型规模和训练时长下均保持零样本泛化性能。特别值得注意的是,通过配备简单、免训练的基于度量的适应机制,我们的表示能力轻松超越更复杂且昂贵的基于优化的方案,极大简化了新领域的泛化过程。