Large-scale visual language models are widely used as pre-trained models and then adapted for various downstream tasks. While humans are known to efficiently learn new tasks from a few examples, deep learning models struggle with adaptation from few examples. In this work, we look into task adaptation in the low-data regime, and provide a thorough study of the existing adaptation methods for generative Visual Language Models. And we show important benefits of self-labelling, i.e. using the model's own predictions to self-improve when having access to a larger number of unlabelled images of the same distribution. Our study demonstrates significant gains using our proposed task adaptation pipeline across a wide range of visual language tasks such as visual classification (ImageNet), visual captioning (COCO), detailed visual captioning (Localised Narratives) and visual question answering (VQAv2).
翻译:大规模视觉语言模型常被用作预训练模型,随后针对各类下游任务进行适配。尽管人类能够从少量样本中高效学习新任务,但深度学习模型在少样本情况下的适配仍存在困难。本研究聚焦低数据场景下的任务适配问题,对现有面向生成式视觉语言模型的适配方法进行了系统研究,并揭示了自标注(即利用模型自身预测结果,在可获取同一分布的大量未标注图像时进行自我改进)的重要优势。我们的实验证明,所提出的任务适配流程在视觉分类(ImageNet)、视觉描述(COCO)、细粒度视觉描述(本地化叙述)及视觉问答(VQAv2)等广泛的视觉语言任务中均实现了显著的性能提升。