Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.
翻译:大规模视觉-语言模型(如CLIP)虽能实现较强的零样本识别能力,但在处理预训练期间极少出现的类别时仍面临困难,包括新出现的实体与文化特定类别。本文提出LiteEmbed——一种轻量级的CLIP少样本个性化框架,可在不重新训练编码器的前提下扩展新类别。该方法通过基于PCA的分解将粗粒度语义方向与细粒度变化解耦,在CLIP的词嵌入空间内执行子空间引导的文本嵌入优化。通过粗粒度对齐与细粒度分离这两个互补目标,在保持全局语义一致性的同时,增强了视觉相似类别间的可区分性。优化后的嵌入即插即用,可无缝替代CLIP原始文本特征,适用于分类、检索、分割与检测等任务。大量实验表明,本方法较现有技术取得显著提升,为CLIP适配低代表性、稀有或未见类别提供了有效解决方案。