Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbf{SAGA}, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.
翻译:视觉检索编码器通常通过类别标签监督训练:每个训练对简化为一个标量,均匀地将嵌入推远或拉近,仿佛每个视觉属性要么不同要么匹配。当多模态大语言模型(MLLM)面对同一训练对时,它能清晰描述这些属性,并据此预测图像是否属于同一类别。我们提出**SAGA**框架,将此基于语言、感知属性的认知转化为编码器的训练信号。具体而言,我们采用组相对策略优化(GRPO)奖励MLLM对视觉编码器令牌的正确预测。由于正确预测需确保令牌能暴露图像对间差异或匹配的特定属性,梯度促使编码器对这些属性进行编码,从而以属性解析监督替代统一的逐对标量监督。辅助注意力蒸馏损失将编码器的嵌入锚定至MLLM关注的令牌,而标准度量学习损失则塑造用于近邻检索的嵌入几何结构。MLLM在训练中保持冻结,推理阶段完全移除,仅产生与度量学习基线相当的部署代价。在CUB-200-2011、Cars-196、FGVC-Aircraft及iNaturalist Aves数据集的零样本图像检索任务中,SAGA在Recall@1指标上超越现有最先进基线3至6个百分点。