We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models. Our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct a training signal consisting of interleaved sequences of image and pseudocaption pairs and a query image, which we denote as the 'self-context' sequence. Based on this signal the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research and applications in open-ended few-shot learning that otherwise requires access to large or proprietary models.
翻译:我们提出自上下文适应(SeCAt),一种自监督方法,能够解锁小型视觉语言模型在开放式分类中的少样本能力。该方法通过聚类大量图像并为其分配语义无关的簇名称,以自监督方式模拟图像描述。由此,我们构建一个由图像与伪描述对的交织序列以及查询图像组成的训练信号,称之为“自上下文”序列。基于该信号,模型被训练生成正确的伪描述。我们在多个涵盖不同粒度的多模态少样本数据集上展示了SeCAt的性能与灵活性。通过使用约1B参数的模型,我们超越了Frozen和FROMAGe等更大规模模型的少样本能力。SeCAt为无需依赖大型或专有模型的开放式少样本学习研究与应用开辟了新可能。