Synthetic query generation has become essential for training dense retrievers, yet prior methods generate one query per document, focusing solely on query quality. We are the first to systematically study multi-query synthesis and discover a quality-diversity trade-off: high-quality queries benefit in-domain tasks, while diverse queries benefit out-of-domain (OOD) generalization. Through controlled experiments on 4 benchmark types across Contriever, RetroMAE, and Qwen3-Embedding, we find that diversity benefit strongly correlates with query complexity (r$\geq$0.95, p<0.05), approximated by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. Based on CDP, we propose complexity-aware training: multi-query synthesis for high-complexity tasks and CW-weighted training for existing data. Both strategies improve OOD performance on reasoning-intensive benchmarks, with compounded gains when combined.
翻译:合成查询生成已成为训练稠密检索器的关键方法,然而现有技术通常为每篇文档仅生成一条查询,仅关注查询质量。本研究首次系统性地探究多查询合成,并揭示了质量与多样性之间的权衡关系:高质量查询有助于提升域内任务表现,而多样化查询则有利于增强域外泛化能力。通过在Contriever、RetroMAE和Qwen3-Embedding三个模型上对四种基准类型开展对照实验,我们发现多样性收益与查询复杂度呈现强相关性(r≥0.95,p<0.05),其中复杂度通过实义词(CW)进行近似度量。我们将此现象形式化为复杂度-多样性原则(CDP):查询复杂度决定最优多样性水平。基于CDP,我们提出复杂度感知训练策略:针对高复杂度任务采用多查询合成方法,对现有数据实施CW加权训练。两种策略均能提升推理密集型基准的域外性能,且组合应用时可产生叠加增益。