A good deal of recent research has focused on how Large Language Models (LLMs) may be used as `judges' in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these `judges' are right for the right reasons. Our findings suggest that LLMs-as-judges work best under human supervision.
翻译:近期大量研究聚焦于如何将大型语言模型(LLMs)作为“评判者”替代人类,以评估各类文本/图像处理系统输出的质量。在此宏观背景下,多项研究探讨了LLMs在信息检索(IR)标准即席检索任务中作为相关性评估者的具体效能问题。本研究通过探索更多问题扩展了这些研究。最重要的是,我们采用INEX计划创建的基于维基百科的测试集,并提示LLMs不仅判断文档相关/不相关,同时在其认为有用的文档中高亮相关段落。创建该测试集时人类相关性评估者获得了类似指令,即被要求高亮文档中所有响应查询信息需求的段落。这使得我们不仅能在文档层面评估LLMs作为评判者的质量,还能量化这些“评判者”基于正确理由做出判断的频率。我们的研究结果表明,LLMs作为评判者在人类监督下表现最佳。