Synthetic Test Collections for Retrieval Evaluation

Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recently gained significant attention in various applications. In IR, while previous work exploited the capabilities of LLMs to generate synthetic queries or documents to augment training data and improve the performance of ranking models, using LLMs for constructing synthetic test collections is relatively unexplored. Previous studies demonstrate that LLMs have the potential to generate synthetic relevance judgments for use in the evaluation of IR systems. In this paper, we comprehensively investigate whether it is possible to use LLMs to construct fully synthetic test collections by generating not only synthetic judgments but also synthetic queries. In particular, we analyse whether it is possible to construct reliable synthetic test collections and the potential risks of bias such test collections may exhibit towards LLM-based models. Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.

翻译：测试集在信息检索（IR）系统评估中扮演着关键角色。为构建测试集获取多样化的用户查询颇具挑战性，而获取相关性判断（即标注检索文档与查询的匹配程度）往往成本高昂且资源密集。近年来，利用大型语言模型（LLM）生成合成数据集在各类应用中获得了广泛关注。在信息检索领域，尽管已有研究利用LLM生成合成查询或文档来增强训练数据并提升排序模型性能，但使用LLM构建合成测试集的研究仍相对较少。先前研究表明，LLM具备生成用于IR系统评估的合成相关性判断的潜力。本文全面探究能否通过LLM同时生成合成判断与合成查询，从而构建完全合成的测试集。具体而言，我们分析了构建可靠合成测试集的可行性，以及此类测试集可能对基于LLM的模型产生偏差的潜在风险。实验结果表明，利用LLM可以构建出能够可靠用于检索评估的合成测试集。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日