This resource paper addresses the challenge of evaluating Information Retrieval (IR) systems in the era of autoregressive Large Language Models (LLMs). Traditional methods relying on passage-level judgments are no longer effective due to the diversity of responses generated by LLM-based systems. We provide a workbench to explore several alternative evaluation approaches to judge the relevance of a system's response that incorporate LLMs: 1. Asking an LLM whether the response is relevant; 2. Asking the LLM which set of nuggets (i.e., relevant key facts) is covered in the response; 3. Asking the LLM to answer a set of exam questions with the response. This workbench aims to facilitate the development of new, reusable test collections. Researchers can manually refine sets of nuggets and exam questions, observing their impact on system evaluation and leaderboard rankings. Resource available at https://github.com/TREMA-UNH/autograding-workbench
翻译:本资源论文旨在解决自回归大语言模型时代下信息检索系统的评估难题。由于基于大语言模型的系统生成响应的多样性,依赖段落级判断的传统方法已不再有效。我们提供了一个工作台,用于探索结合大语言模型的多种替代评估方法以判断系统响应的相关性:1. 询问大语言模型该响应是否相关;2. 要求大语言模型指出响应中覆盖了哪一组关键信息片段;3. 让大语言模型基于该响应回答一组测试问题。该工作台旨在促进新型可复用测试集的开发。研究人员可手动优化关键信息片段集和测试问题集,并观察其对系统评估和排行榜排名的影响。资源获取地址:https://github.com/TREMA-UNH/autograding-workbench