Large pre-trained vision-language models have shown great prominence in transferring pre-acquired knowledge to various domains and downstream tasks with appropriate prompting or tuning. Existing prevalent tuning methods can be generally categorized into three genres: 1) prompt engineering by creating suitable prompt texts, which is time-consuming and requires domain expertise; 2) or simply fine-tuning the whole model, which is extremely inefficient; 3) prompt tuning through parameterized prompt embeddings with the text encoder. Nevertheless, all methods rely on the text encoder for bridging the modality gap between vision and language. In this work, we question the necessity of the cumbersome text encoder for a more lightweight and efficient tuning paradigm as well as more representative prompt embeddings closer to the image representations. To achieve this, we propose a Concept Embedding Search (ConES) approach by optimizing prompt embeddings -- without the need of the text encoder -- to capture the 'concept' of the image modality through a variety of task objectives. By dropping the text encoder, we are able to significantly speed up the learning process, \eg, from about an hour to just ten minutes in our experiments for personalized text-to-image generation without impairing the generation quality. Moreover, our proposed approach is orthogonal to current existing tuning methods since the searched concept embeddings can be further utilized in the next stage of fine-tuning the pre-trained large models for boosting performance. Extensive experiments show that our approach can beat the prompt tuning and textual inversion methods in a variety of downstream tasks including objection detection, instance segmentation, and image generation. Our approach also shows better generalization capability for unseen concepts in specialized domains, such as the medical domain.
翻译:大型预训练视觉语言模型在适当的提示或微调下,展现出将预获取知识迁移至不同领域及下游任务的卓越能力。现有主流微调方法通常可分为三类:1)通过创建合适提示文本的提示工程方法,但耗时且需领域专业知识;2)直接对整个模型进行微调,但效率极低;3)通过文本编码器参数化提示嵌入的提示微调方法。然而,所有方法均依赖文本编码器来弥合视觉与语言之间的模态差异。本研究质疑了繁琐的文本编码器对于更轻量高效微调范式的必要性,同时探索更接近图像表征的代表性提示嵌入。为此,我们提出概念嵌入搜索方法,通过优化提示嵌入(无需文本编码器),借助多种任务目标捕获图像模态的“概念”。由于移除了文本编码器,我们能够显著加速学习过程——例如,在个性化文本到图像生成实验中,处理时间从约一小时缩短至十分钟,且不损失生成质量。此外,所提方法与现有微调方法正交,因为搜索得到的概念嵌入可进一步用于下一阶段预训练大模型的微调以提升性能。大量实验表明,我们的方法在目标检测、实例分割和图像生成等多种下游任务中优于提示微调和文本反演方法。同时,该方法在医学等专业领域的未见概念中展现出更优的泛化能力。