A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these judges are right for the right reasons. Our observations lead us to reiterate the cautionary note sounded in some earlier studies when it comes to using LLMs as assessors for creating IR datasets: while LLMs are unquestionably promising, and may be used judiciously to subtantially reduce the amount of human involvement required to generate high-quality benchmark datasets, they cannot replace humans as assessors.
翻译:近年大量研究聚焦于如何利用大型语言模型(LLMs)替代人类,评估各类文本/图像处理系统输出质量。在此背景下,多项研究探讨了LLMs在信息检索(IR)标准即席任务中作为相关性评估者的效能。我们通过探究额外问题拓展了这些研究。最重要的是,我们采用INEX倡议创建的维基百科测试集,不仅引导LLMs判断文档相关/不相关,还要求其标出认为有用的相关段落。参与构建该测试集的人类相关性评估者获得了类似指令,即标注文档中回应查询信息需求的所有段落。这使得我们不仅能评估LLMs作为评判者在文档层面的质量,还能量化这些评判者"理由正确的正确判断"频率。我们的观察结果重申了早期研究的警示:虽然LLMs无疑前景广阔,且可审慎用于大幅减少创建高质量基准数据集所需的人工参与,但它们无法替代人类评估者。