Prior work reports conflicting results on query diversity in synthetic data generation for dense retrieval. We identify this conflict and design Q-D metrics to quantify diversity's impact, making the problem measurable. Through experiments on 4 benchmark types (31 datasets), we find query diversity especially benefits multi-hop retrieval. Deep analysis on multi-hop data reveals that diversity benefit correlates strongly with query complexity ($r$$\geq$0.95, $p$$<$0.05 in 12/14 conditions), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides actionable thresholds (CW$>$10: use diversity; CW$<$7: avoid it). Guided by CDP, we propose zero-shot multi-query synthesis for multi-hop tasks, achieving state-of-the-art performance.
翻译:先前研究在稠密检索的合成数据生成中关于查询多样性的影响报告了相互矛盾的结果。我们识别了这一矛盾,并设计了Q-D指标来量化多样性的影响,使该问题可测量。通过对4种基准类型(31个数据集)的实验,我们发现查询多样性尤其有利于多跳检索。对多跳数据的深入分析表明,多样性收益与查询复杂度(由内容词CW衡量)高度相关(在12/14种条件下$r$$\geq$0.95,$p$$<$0.05)。我们将此形式化为复杂度-多样性原则:查询复杂度决定最优多样性。CDP提供了可操作的阈值(CW$>$10时使用多样性;CW$<$7时避免使用)。在CDP的指导下,我们提出了针对多跳任务的零样本多查询合成方法,实现了最先进的性能。