Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.
翻译:文本-图像检索对于产品推荐等应用至关重要。基于嵌入的方法(如CLIP)通过向量相似性搜索实现了高效的大规模检索,但这些方法主要基于字面描述式的文本-图像对进行训练,往往难以捕捉产品推荐场景中常见的抽象化或角色驱动属性(例如"送给热爱园艺的母亲的礼物")。相比之下,前沿的视觉-语言模型(vLLMs)能够以灵活的方式对齐文本与图像,但其有限的上下文窗口使其无法直接处理大规模商品目录的检索。我们提出一种将强大vLLM的偏好排序蒸馏至基于嵌入系统的框架,在保持基于嵌入方法推理时可扩展性的同时,迁移其精细的对齐能力。在角色驱动的产品推荐任务上的实验表明,我们的方法显著优于现有基于嵌入的基线,为个性化文本-图像检索提供了高效解决方案。