Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. <orange-cat> = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an $\epsilon$-ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: https://visual-words.github.io.
翻译:大规模多模态模型(如Stable Diffusion)在仅微调单个词嵌入后,即可生成、检测和分类新的视觉概念。不同模型对相同概念是否学习到相似的词汇(例如<orange-cat> = orange + cat)?我们对文本到图像生成、开放集目标检测和零样本分类三类最先进模型开展大规模分析,发现新词嵌入具有模型特异性且不可迁移。基于四个标准数据集,针对40种多样化视觉概念训练了4,800个新嵌入,我们发现在任意先验嵌入的ε-球内存在扰动,可生成、检测和分类任意概念。当将这些新嵌入拼接至新模型时,针对原始模型的微调效果将丢失。研究表明,流行的软提示微调方法在应用于视觉概念学习任务时会产生这些扰动解,且视觉概念的嵌入不可迁移。复现本工作的代码见:https://visual-words.github.io