Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spoken query retrieval that includes a large-scale dataset and a unified evaluation protocol. SQuTR aggregates 37,317 unique queries from six commonly used English and Chinese text retrieval datasets, spanning multiple domains and diverse query types. We synthesize speech using voice profiles from 200 real speakers and mix 17 categories of real-world environmental noise under controlled SNR levels, enabling reproducible robustness evaluation from quiet to highly noisy conditions. Under the unified protocol, we conduct large-scale evaluations on representative cascaded and end-to-end retrieval systems. Experimental results show that retrieval performance decreases as noise increases, with substantially different drops across systems. Even large-scale retrieval models struggle under extreme noise, indicating that robustness remains a critical bottleneck. Overall, SQuTR provides a reproducible testbed for benchmarking and diagnostic analysis, and facilitates future research on robustness in spoken query to text retrieval.
翻译:口语查询检索是现代信息检索中的重要交互模式。然而,现有评估数据集通常局限于受限噪声条件下的简单查询,不足以评估口语查询检索系统在复杂声学扰动下的鲁棒性。为弥补这一不足,我们提出了SQuTR——一个包含大规模数据集和统一评估协议的口语查询检索鲁棒性基准。SQuTR汇集了来自六个常用中英文文本检索数据集的37,317条独特查询,涵盖多领域和多样化的查询类型。我们使用200位真实说话人的语音配置文件合成语音,并在受控信噪比水平下混合17类真实环境噪声,实现了从安静到高度嘈杂条件下可复现的鲁棒性评估。在统一协议下,我们对代表性的级联检索系统和端到端检索系统进行了大规模评估。实验结果表明,检索性能随噪声增加而下降,不同系统的性能下降幅度存在显著差异。即使大规模检索模型在极端噪声下也表现不佳,表明鲁棒性仍是关键瓶颈。总体而言,SQuTR为基准测试和诊断分析提供了可复现的实验平台,并将推动未来口语查询文本检索鲁棒性研究的进展。