Making the relevance judgments for a TREC-style test collection can be complex and expensive. A typical TREC track usually involves a team of six contractors working for 2-4 weeks. Those contractors need to be trained and monitored. Software has to be written to support recording relevance judgments correctly and efficiently. The recent advent of large language models that produce astoundingly human-like flowing text output in response to a natural language prompt has inspired IR researchers to wonder how those models might be used in the relevance judgment collection process. At the ACM SIGIR 2024 conference, a workshop ``LLM4Eval'' provided a venue for this work, and featured a data challenge activity where participants reproduced TREC deep learning track judgments, as was done by Thomas et al (arXiv:2408.08896, arXiv:2309.10621). I was asked to give a keynote at the workshop, and this paper presents that keynote in article form. The bottom-line-up-front message is, don't use LLMs to create relevance judgments for TREC-style evaluations.
翻译:构建TREC风格测试集的相关性判定工作既复杂又昂贵。典型的TREC评测任务通常需要六人团队耗费2-4周完成。这些人员需要经过培训与监督,还需开发专用软件以确保相关性判定的准确高效记录。近期,大语言模型能够根据自然语言提示生成令人惊叹的类人流畅文本,这一突破激发了信息检索研究者探索如何将这些模型应用于相关性判定收集过程。在ACM SIGIR 2024会议上,"LLM4Eval"研讨会为此类研究提供了交流平台,并设立了数据挑战活动——参与者需复现TREC深度学习评测任务中的相关性判定(参照Thomas等人发表于arXiv:2408.08896与arXiv:2309.10621的研究)。本人受邀在该研讨会发表主题报告,本文即该报告的论文形式呈现。核心结论可概括为:切勿在TREC风格评测中使用LLM生成相关性判定。