Vision-language models (VLMs) offer a promising paradigm for image classification by comparing the similarity between images and class embeddings. A critical challenge lies in crafting precise textual representations for class names. While previous studies have leveraged recent advancements in large language models (LLMs) to enhance these descriptors, their outputs often suffer from ambiguity and inaccuracy. We identify two primary causes: 1) The prevalent reliance on textual interactions with LLMs, leading to a mismatch between the generated text and the visual content in VLMs' latent space - a phenomenon we term the "explain without seeing" dilemma. 2) The oversight of the inter-class relationships, resulting in descriptors that fail to differentiate similar classes effectively. To address these issues, we propose a novel image classification framework combining VLMs with LLMs, named Iterative Optimization with Visual Feedback. In particular, our method develops an LLM-based agent, employing an evolutionary optimization strategy to refine class descriptors. Crucially, we incorporate visual feedback from VLM classification metrics, thereby guiding the optimization process with concrete visual data. Our method leads to improving accuracy on a wide range of image classification benchmarks, with 3.47\% average gains over state-of-the-art methods. We also highlight the resulting descriptions serve as explainable and robust features that can consistently improve the performance across various backbone models.
翻译:视觉-语言模型(VLM)通过比较图像与类别嵌入的相似性,为图像分类提供了有效范式。其核心挑战在于如何构建精确的类别文本表征。尽管已有研究利用大语言模型(LLM)的最新进展来增强这些描述符,但生成的文本仍常存在歧义与不准确问题。我们识别出两大根本原因:1)过度依赖基于文本的LLM交互,导致生成文本与VLM潜在空间中的视觉内容错位——我们称之为"无视觉解释"困境;2)忽视类别间关系,使得描述符无法有效区分相似类别。针对这些问题,我们提出了一种融合VLM与LLM的新型图像分类框架——视觉反馈迭代优化。具体而言,该方法构建基于LLM的智能体,通过进化优化策略改进类别描述符。关键在于,我们引入VLM分类指标作为视觉反馈,利用具体视觉数据引导优化过程。在多个图像分类基准测试中,本方法较现有最优模型平均提升3.47%的准确率。此外,生成的描述作为可解释且鲁棒的特征,能够持续提升各类骨干网络的性能表现。