With the development of Vision-Language Pre-training Models (VLPMs) represented by CLIP and ALIGN, significant breakthroughs have been achieved for association-based visual tasks such as image classification and image-text retrieval by the zero-shot capability of CLIP without fine-tuning. However, CLIP is hard to apply to generation-based tasks. This is due to the lack of decoder architecture and pre-training tasks for generation. Although previous works have created generation capacity for CLIP through additional language models, a modality gap between the CLIP representations of different modalities and the inability of CLIP to model the offset of this gap, which fails the concept to transfer across modalities. To solve the problem, we try to map images/videos to the language modality and generate captions from the language modality. In this paper, we propose the K-nearest-neighbor Cross-modality Mapping (Knight), a zero-shot method from association to generation. With text-only unsupervised training, Knight achieves state-of-the-art performance in zero-shot methods for image captioning and video captioning. Our code is available at https://github.com/junyangwang0410/Knight.
翻译:以CLIP和ALIGN为代表的视觉-语言预训练模型(VLPMs)的发展,使得基于关联的视觉任务(如图像分类和图文检索)通过CLIP的零样本能力无需微调即可取得显著突破。然而,CLIP难以应用于生成型任务,这是由于缺乏解码器架构和生成任务所需的预训练机制。尽管已有研究通过引入额外的语言模型赋予CLIP生成能力,但不同模态的CLIP表示之间存在模态间隙,且CLIP无法对该间隙的偏移进行建模,导致概念无法跨模态迁移。为解决这一问题,我们尝试将图像/视频映射至语言模态,并从语言模态生成描述文本。本文提出K近邻跨模态映射(Knight),一种从关联到生成的零样本方法。通过纯文本无监督训练,Knight在图像描述和视频描述的零样本方法中达到了最优性能。我们的代码已开源至https://github.com/junyangwang0410/Knight。