Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.
翻译:我们对视觉世界的理解围绕着各种概念轴线展开,这些轴线表征了视觉实体的不同方面。尽管语言可以轻松指定不同的概念轴线(如颜色),但每条轴线上的具体视觉细微差别往往超出语言表述的局限(例如某种特定的绘画风格)。本研究旨在通过蒸馏大规模预训练的视觉-语言模型,学习一种语言指导的视觉概念表示。具体而言,我们训练一组概念编码器,以编码与一组语言指导的概念轴线相关的信息,其目标是通过预训练的文本到图像(Text-to-Image, T2I)模型再现输入图像。为促进不同概念编码器间的更好解耦,我们将概念嵌入锚定到由预训练的视觉问答(Visual Question Answering, VQA)模型获得的文本嵌入上。在推理阶段,该模型从新的测试图像中提取沿各轴线的概念嵌入,这些嵌入可被重新混合以生成具有新颖视觉概念组合的图像。通过轻量级的测试时微调过程,该模型还能泛化到训练中未见的新概念。