With the advent of large pre-trained vision-language models such as CLIP, prompt learning methods aim to enhance the transferability of the CLIP model. They learn the prompt given few samples from the downstream task given the specific class names as prior knowledge, which we term as semantic-aware classification. However, in many realistic scenarios, we only have access to few samples and knowledge of the class names (e.g., when considering instances of classes). This challenging scenario represents the semantic-agnostic discriminative case. Text-to-Image (T2I) personalization methods aim to adapt T2I models to unseen concepts by learning new tokens and endowing these tokens with the capability of generating the learned concepts. These methods do not require knowledge of class names as a semantic-aware prior. Therefore, in this paper, we first explore Textual Inversion and reveal that the new concept tokens possess both generation and classification capabilities by regarding each category as a single concept. However, learning classifiers from single-concept textual inversion is limited since the learned tokens are suboptimal for the discriminative tasks. To mitigate this issue, we propose Multi-Class textual inversion, which includes a discriminative regularization term for the token updating process. Using this technique, our method MC-TI achieves stronger Semantic-Agnostic Classification while preserving the generation capability of these modifier tokens given only few samples per category. In the experiments, we extensively evaluate MC-TI on 12 datasets covering various scenarios, which demonstrates that MC-TI achieves superior results in terms of both classification and generation outcomes.
翻译:随着CLIP等大规模预训练视觉-语言模型的出现,提示学习方法旨在提升CLIP模型的可迁移性。这些方法在给定下游任务少量样本时,以特定类别名称作为先验知识学习提示,我们将其称为语义感知分类。然而,在许多现实场景中,我们仅能获取少量样本且缺乏类别名称知识(例如仅接触类别实例时)。这种具有挑战性的场景代表了语义无关的判别性场景。文本到图像个性化方法旨在通过习得新词元并使这些词元具备生成学习概念的能力,使T2I模型适应未见概念。此类方法无需以类别名称作为语义感知先验。因此,本文首先探究文本反演技术,发现通过将每个类别视为独立概念时,新概念词元同时具备生成与分类能力。然而,基于单概念文本反演学习分类器存在局限,因为习得的词元对于判别任务并非最优。为缓解此问题,我们提出多类别文本反演方法,该方法在词元更新过程中引入判别性正则项。通过该技术,我们的方法MC-TI在仅需每类别少量样本的条件下,实现了更强的语义无关分类能力,同时保持了这些修饰词元的生成能力。实验中,我们在涵盖多种场景的12个数据集上对MC-TI进行了全面评估,结果表明MC-TI在分类与生成结果方面均取得优越性能。