Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics. Prior works have largely connected image to text through pretraining and/or fine-tuning on curated image-text datasets, which can be a costly and expensive process. In order to resolve this limitation, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language models (e.g., BERT, RoBERTa). Our main idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings. By doing so, LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features. To the best of our knowledge, our work is the first work that uses unaligned images for multimodal tasks by leveraging the power of pretrained language models.

翻译：近期大语言模型规模的扩展在广泛文本型任务的少样本学习上展现了惊人能力。然而，一个关键局限在于这些语言模型从根本上缺乏视觉感知能力——这一关键属性对于扩展模型以使其能够与现实世界交互并解决视觉任务（如视觉问答和机器人技术）至关重要。先前工作主要通过在有标注的图文数据集上进行预训练和/或微调来连接图像与文本，这一过程可能昂贵且耗时。为解决这一局限，我们提出一种简单有效的方法，称为语言量化自编码器（LQAE），这是对VQ-VAE的改进，通过利用预训练语言模型（如BERT、RoBERTa）以无监督方式学习对齐图文数据。我们的核心思想是：直接使用预训练语言码本量化图像嵌入，将图像编码为文本令牌序列。随后应用随机掩码和BERT模型，并由解码器从BERT预测的文本令牌嵌入重建原始图像。通过这种方式，LQAE能学习将相似图像表示为相似的文本令牌簇，从而在不使用对齐图文对的情况下对齐两种模态。这使得大语言模型（如GPT-3）能够进行少样本图像分类，并基于BERT文本特征实现图像的线性分类。据我们所知，本研究是首个利用预训练语言模型能力处理非对齐图像的多模态任务的工作。