Prior synthetic query generation for dense retrieval produces one query per document, focusing on quality. We systematically study multi-query synthesis, discovering a quality-diversity trade-off: quality benefits in-domain, diversity benefits out-of-domain (OOD). Experiments on 31 datasets show diversity especially benefits multi-hop retrieval. Analysis reveals diversity benefit correlates with query complexity (r>=0.95), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides thresholds (CW>10: use diversity; CW<7: avoid it) and enables CW-weighted training that improves OOD even with single-query data.
翻译:先前为稠密检索生成的合成查询遵循每个文档仅生成一个查询的模式,并侧重于查询质量。我们系统研究了多查询合成方法,发现质量与多样性之间存在权衡关系:质量提升在领域内检索中具有优势,而多样性增强则在领域外(OOD)检索中更为有益。在31个数据集上的实验表明,多样性对多跳检索任务尤其有利。分析揭示多样性收益与查询复杂度(通过内容词数量衡量,相关系数r≥0.95)存在强相关性。我们将此现象形式化为复杂度-多样性原则(CDP):查询复杂度决定最优多样性水平。CDP提供了明确阈值(内容词>10时采用多样性策略;内容词<7时避免使用),并实现了基于内容词加权的训练方法,即使使用单查询数据也能提升领域外检索性能。