Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics. Prior works have largely connected image to text through pretraining and/or fine-tuning on curated image-text datasets, which can be a costly and expensive process. In order to resolve this limitation, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language models (e.g., BERT, RoBERTa). Our main idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings. By doing so, LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features. To the best of our knowledge, our work is the first work that uses unaligned images for multimodal tasks by leveraging the power of pretrained language models.
翻译:近年来,大规模语言模型的扩展在各类基于文本的任务中展现了卓越的小样本学习能力。然而,一个关键局限在于这些语言模型从根本上缺乏视觉感知能力——这是扩展模型以使其能够与现实世界交互并解决视觉任务(如视觉问答和机器人技术)所必备的属性。先前的工作主要通过预训练和/或在精心整理的图像-文本数据集上进行微调来连接图像与文本,这一过程代价高昂且成本高企。为解决这一局限,我们提出一种简单而有效的方法——语言量化自编码器(LQAE),它是VQ-VAE的改进版本,通过利用预训练语言模型(如BERT、RoBERTa)以无监督方式学习对齐文本-图像数据。我们的核心思路是:直接使用预训练语言编码本对图像嵌入进行量化,将图像编码为文本标记序列。随后对编码结果施加随机掩码,并通过BERT模型处理,让解码器从BERT预测的文本标记嵌入中重建原始图像。通过这种方式,LQAE学会用相似的文本标记簇表征相似图像,从而无需使用对齐的文本-图像对即可实现两种模态的对齐。这使得大规模语言模型(如GPT-3)能够进行小样本图像分类,以及基于BERT文本特征的图像线性分类。据我们所知,本工作是首次利用预训练语言模型的能力,通过非对齐图像完成多模态任务的研究。