Vision-language models (VLMs) offer a promising paradigm for image classification by comparing the similarity between images and class embeddings. A critical challenge lies in crafting precise textual representations for class names. While previous studies have leveraged recent advancements in large language models (LLMs) to enhance these descriptors, their outputs often suffer from ambiguity and inaccuracy. We attribute this to two primary factors: 1) the reliance on single-turn textual interactions with LLMs, leading to a mismatch between generated text and visual concepts for VLMs; 2) the oversight of the inter-class relationships, resulting in descriptors that fail to differentiate similar classes effectively. In this paper, we propose a novel framework that integrates LLMs and VLMs to find the optimal class descriptors. Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors. We demonstrate our optimized descriptors are of high quality which effectively improves classification accuracy on a wide range of benchmarks. Additionally, these descriptors offer explainable and robust features, boosting performance across various backbone models and complementing fine-tuning-based methods.
翻译:视觉语言模型(VLM)通过比较图像与类别嵌入的相似性,为图像分类提供了有前景的范式。其关键挑战在于如何为类别名称构建精确的文本描述。尽管已有研究利用大语言模型(LLM)的最新进展来增强这些描述符,但生成的文本常存在歧义和不准确问题。我们将此归因于两个主要因素:1)依赖与LLM的单轮文本交互,导致生成的文本与VLM的视觉概念不匹配;2)忽视类别间关系,导致描述符难以有效区分相似类别。本文提出一种集成LLM与VLM的全新框架,用于寻找最优类别描述符。我们的无训练方法开发了基于LLM的智能体,采用演化优化策略迭代优化类别描述符。实验证明,优化后的描述符具有高质量,能有效提升多个基准测试的分类精度。此外,这些描述符具备可解释性与鲁棒性,可增强多种骨干模型性能,并与基于微调的方法形成互补。