We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input -- a process we term De-Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples.
翻译:我们展示了文本作为一种强大的跨模态接口。不同于依赖深度嵌入作为接口表示来连接图像和语言,我们的方法将图像表示为文本,从而享有自然语言固有的可解释性与灵活性。我们采用一个自动编码器,其解码器使用预训练的文本到图像扩散模型。编码器被训练将输入图像转换为文本,随后将文本输入固定的文本到图像扩散解码器中,以重建原始输入——我们将这一过程称为“去扩散”。实验验证了去扩散文本在表征图像方面的精确性与全面性,使其能够被现成的文本到图像工具和大语言模型直接用于多样化的多模态任务。例如,单一的去扩散模型可泛化用于为不同文本到图像工具提供可迁移的提示词,并通过简单的少样本提示大语言模型,在开放式视觉语言任务上达到了新的最优水平。