Dense retrieval has shown promise in the first-stage retrieval process when trained on in-domain labeled datasets. However, previous studies have found that dense retrieval is hard to generalize to unseen domains due to its weak modeling of domain-invariant and interpretable feature (i.e., matching signal between two texts, which is the essence of information retrieval). In this paper, we propose a novel method to improve the generalization of dense retrieval via capturing matching signal called BERM. Fully fine-grained expression and query-oriented saliency are two properties of the matching signal. Thus, in BERM, a single passage is segmented into multiple units and two unit-level requirements are proposed for representation as the constraint in training to obtain the effective matching signal. One is semantic unit balance and the other is essential matching unit extractability. Unit-level view and balanced semantics make representation express the text in a fine-grained manner. Essential matching unit extractability makes passage representation sensitive to the given query to extract the pure matching information from the passage containing complex context. Experiments on BEIR show that our method can be effectively combined with different dense retrieval training methods (vanilla, hard negatives mining and knowledge distillation) to improve its generalization ability without any additional inference overhead and target domain data.
翻译:稠密检索在基于域内标注数据训练的阶段一检索过程中展现出潜力。然而,先前研究发现稠密检索难以泛化到未见过领域,原因在于其对域不变且可解释特征(即两段文本间的匹配信号,这是信息检索的本质)建模较弱。本文提出一种名为BERM的新方法,通过捕捉匹配信号提升稠密检索的泛化能力。完全细粒度表达与面向查询的显著性构成匹配信号的两大属性。为此,BERM将单篇段落分割为多个单元,并提出两种单元级表征要求作为训练约束以获取有效匹配信号:一是语义单元平衡性,二是关键匹配单元可提取性。单元级视角与平衡语义使表征能够细粒度地表达文本,而关键匹配单元可提取性则使段落表征对给定查询保持敏感性,从而从包含复杂上下文的段落中提取纯正匹配信息。在BEIR上的实验表明,本方法可有效结合不同稠密检索训练方法(基础方法、难负例挖掘及知识蒸馏)提升其泛化能力,且无需额外推理开销与目标域数据。