Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recently gained significant attention in various applications. In IR, while previous work exploited the capabilities of LLMs to generate synthetic queries or documents to augment training data and improve the performance of ranking models, using LLMs for constructing synthetic test collections is relatively unexplored. Previous studies demonstrate that LLMs have the potential to generate synthetic relevance judgments for use in the evaluation of IR systems. In this paper, we comprehensively investigate whether it is possible to use LLMs to construct fully synthetic test collections by generating not only synthetic judgments but also synthetic queries. In particular, we analyse whether it is possible to construct reliable synthetic test collections and the potential risks of bias such test collections may exhibit towards LLM-based models. Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.
翻译:测试集在信息检索(IR)系统评估中扮演着关键角色。为构建测试集获取多样化的用户查询颇具挑战性,而获取相关性判断(即标注检索文档与查询的匹配程度)往往成本高昂且资源密集。近年来,利用大型语言模型(LLM)生成合成数据集在各类应用中获得了广泛关注。在信息检索领域,尽管已有研究利用LLM生成合成查询或文档来增强训练数据并提升排序模型性能,但使用LLM构建合成测试集的研究仍相对较少。先前研究表明,LLM具备生成用于IR系统评估的合成相关性判断的潜力。本文全面探究能否通过LLM同时生成合成判断与合成查询,从而构建完全合成的测试集。具体而言,我们分析了构建可靠合成测试集的可行性,以及此类测试集可能对基于LLM的模型产生偏差的潜在风险。实验结果表明,利用LLM可以构建出能够可靠用于检索评估的合成测试集。