Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbf{SAGA}, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

翻译：视觉检索编码器通常通过类别标签监督训练：每个训练对简化为一个标量，均匀地将嵌入推远或拉近，仿佛每个视觉属性要么不同要么匹配。当多模态大语言模型（MLLM）面对同一训练对时，它能清晰描述这些属性，并据此预测图像是否属于同一类别。我们提出**SAGA**框架，将此基于语言、感知属性的认知转化为编码器的训练信号。具体而言，我们采用组相对策略优化（GRPO）奖励MLLM对视觉编码器令牌的正确预测。由于正确预测需确保令牌能暴露图像对间差异或匹配的特定属性，梯度促使编码器对这些属性进行编码，从而以属性解析监督替代统一的逐对标量监督。辅助注意力蒸馏损失将编码器的嵌入锚定至MLLM关注的令牌，而标准度量学习损失则塑造用于近邻检索的嵌入几何结构。MLLM在训练中保持冻结，推理阶段完全移除，仅产生与度量学习基线相当的部署代价。在CUB-200-2011、Cars-196、FGVC-Aircraft及iNaturalist Aves数据集的零样本图像检索任务中，SAGA在Recall@1指标上超越现有最先进基线3至6个百分点。

相关内容

属性

关注 2

一个具体事物，总是有许许多多的性质与关系，我们把一个事物的性质与关系，都叫作事物的属性。事物与属性是不可分的，事物都是有属性的事物，属性也都是事物的属性。一个事物与另一个事物的相同或相异，也就是一个事物的属性与另一事物的属性的相同或相异。由于事物属性的相同或相异，客观世界中就形成了许多不同的事物类。具有相同属性的事物就形成一类，具有不同属性的事物就分别地形成不同的类。

从感知到认知：多模态大语言模型中视觉-语言交互推理综述

专知会员服务

32+阅读 · 2025年10月1日

当持续学习遇上多模态大型语言模型：综述

专知会员服务

32+阅读 · 2025年3月5日

大模型如何做视频理解？最新《多模态大语言模型在全面长视频理解》综述

专知会员服务

30+阅读 · 2024年10月2日

《多模态大语言模型视觉提示》综述

专知会员服务

36+阅读 · 2024年9月25日