MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Wei Chow,Yuan Gao,Linfeng Li,Xian Wang,Qi Xu,Hang Song,Lingdong Kong,Ran Zhou,Yi Zeng,Yidong Cai,Botian Jiang,Shilin Xu,Jiajun Zhang,Minghui Qiu,Xiangtai Li,Tianshu Yang,Siliang Tang,Juncheng Li

from arxiv, NeurIPS 2025; Project Page, Code, and Dataset at: https://merit-2025.github.io/

Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.

翻译：语义检索对于现代应用至关重要，但在当前研究中仍未得到充分探索。现有数据集仅限于单一语言、单一图像或单一检索条件，往往无法充分利用视觉信息的表达能力，这一点在图像被替换为标题时性能仍得以保持的现象中得以证实。然而，实际检索场景常常涉及包含多幅图像的交错多条件查询。为此，本文引入了MERIT，这是首个用于交错多条件语义检索的多语言数据集，包含5种语言、7个不同产品类别下的32万个查询和13.5万个产品。在MERIT上进行的大量实验揭示了现有模型的局限性：仅关注全局语义信息，而忽视了查询中的特定条件元素。因此，我们提出了Coral，一种新颖的微调框架，通过集成嵌入重建以保留细粒度条件元素，并结合对比学习以提取全面的全局语义，从而适配预训练的多模态大语言模型。实验表明，Coral在MERIT上相比传统方法实现了45.9%的性能提升，并在8个成熟的检索基准测试中验证了其强大的泛化能力。总体而言，我们的贡献——新颖的数据集、对现有方法关键局限性的识别以及创新的微调框架——为未来交错多条件语义检索的研究奠定了基础。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日