Scalable information retrieval testing needs corpora that are large enough to stress index construction, ranking latency, query routing, and evaluation tooling, yet human-judged test collections remain expensive and may be unavailable when documents are private or still under design. This paper introduces SPECTRA, a reproducible framework for generating synthetic text corpora and retrieval test collections through a separation of latent topical structure, surface text realization, metadata controls, query intent generation, and deterministic relevance oracles. The framework is intended as a diagnostic complement to Cranfield-style and TREC-style evaluation, not as a replacement for human assessment. A single-process Python prototype generated corpora up to 60,000 documents and 9.61 million tokens while preserving controllable long-tail vocabulary growth and producing graded relevance labels for 96 queries. In the local simulation study, generation remained close to linear at roughly 12K to 14K documents per second, estimated Zipf slopes stayed near 0.86 in absolute value, and increasing cross-topic distractor text reduced BM25 nDCG@10 from 1.00 at 2% distractors to 0.43 at 36% distractors. These results show that lightweight synthetic corpora can expose retrieval-system scaling and failure modes before costly collection construction begins.
翻译:可扩展的信息检索测试需要足够大的语料库来承受索引构建、排序延迟、查询路由和评估工具的压力,但人工评判的测试集仍然昂贵,且当文档涉及隐私或处于设计阶段时可能无法获得。本文提出SPECTRA,一个可复现的框架,通过分离潜在主题结构、表层文本实现、元数据控制、查询意图生成和确定性相关性基准,生成合成文本语料库和检索测试集。该框架旨在作为克兰菲尔德和TREC评估风格的诊断补充,而非替代人工评估。单进程Python原型生成了多达6万篇文档、961万词元的语料库,保持可控的长尾词汇增长,并为96个查询生成分级相关性标签。本地模拟研究中,生成速度接近线性,约每秒1.2万至1.4万篇文档,估计的齐普夫斜率绝对值保持在0.86附近,跨主题干扰文本的增加使BM25在nDCG@10指标上从2%干扰度时的1.00降至36%干扰度时的0.43。结果表明,轻量级合成语料库可在昂贵的人工构建之前暴露检索系统的扩展瓶颈与失效模式。