将教师模型嵌入：蒸馏vLLM偏好以实现可扩展图像检索 (Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval)

Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.

翻译：文本-图像检索对于产品推荐等应用至关重要。基于嵌入的方法（如CLIP）通过向量相似性搜索实现了高效的大规模检索，但这些方法主要基于字面描述式的文本-图像对进行训练，往往难以捕捉产品推荐场景中常见的抽象化或角色驱动属性（例如"送给热爱园艺的母亲的礼物"）。相比之下，前沿的视觉-语言模型（vLLMs）能够以灵活的方式对齐文本与图像，但其有限的上下文窗口使其无法直接处理大规模商品目录的检索。我们提出一种将强大vLLM的偏好排序蒸馏至基于嵌入系统的框架，在保持基于嵌入方法推理时可扩展性的同时，迁移其精细的对齐能力。在角色驱动的产品推荐任务上的实验表明，我们的方法显著优于现有基于嵌入的基线，为个性化文本-图像检索提供了高效解决方案。