When asked, large language models (LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for LLMs to support relevance judgments along with concerns and issues that arise. We devise a human--machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of "fully automated judgments", we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of~LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR researchers.
翻译:当被问及时,像ChatGPT这样的大型语言模型(LLMs)声称它们可以辅助相关性判断,但尚不清楚自动化判断是否能够可靠地用于检索系统的评估。在这篇观点性论文中,我们探讨了LLMs支持相关性判断的潜在方式,以及由此引发的关切与问题。我们设计了一个人机协作光谱,根据人类对机器依赖的程度,对不同的相关性判断策略进行分类。对于"完全自动化判断"这一极端情况,我们进一步开展了一项探索性实验,研究基于LLM的相关性判断是否与经过训练的人类评估员的判断存在相关性。最后,我们提出了对立视角(支持与反对将LLMs用于自动化相关性判断),以及基于文献分析、初步实验证据及我们作为信息检索研究者的经验所形成的折衷视角。