Prior synthetic query generation for dense retrieval produces one query per document, focusing on quality. We systematically study multi-query synthesis, discovering a quality-diversity trade-off: quality benefits in-domain, diversity benefits out-of-domain (OOD). Experiments on 31 datasets show diversity especially benefits multi-hop retrieval. Analysis reveals diversity benefit correlates with query complexity ($r$$\geq$0.95), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides thresholds (CW$>$10: use diversity; CW$<$7: avoid it) and enables CW-weighted training that improves OOD even with single-query data.
翻译:先前针对稠密检索的合成查询生成方法为每篇文档生成单一查询,侧重于查询质量。我们系统性地研究了多查询合成,发现质量与多样性之间存在权衡:质量在领域内有益,而多样性在领域外(OOD)有益。在31个数据集上的实验表明,多样性尤其有利于多跳检索。分析揭示多样性收益与查询复杂度($r$$\geq$0.95)相关,该复杂度通过内容词(CW)数量衡量。我们将此形式化为复杂度-多样性原则(CDP):查询复杂度决定最优多样性水平。CDP提供了阈值指导(CW$>$10时采用多样性;CW$<$7时避免使用),并实现了基于CW加权的训练方法,即使使用单查询数据也能提升OOD性能。