Current IR evaluation is based on relevance judgments, created either manually or automatically, with decisions outsourced to Large Language Models (LLMs). We offer an alternative paradigm, that never relies on relevance judgments in any form. Instead, a text is defined as relevant if it contains information that enables the answering of key questions. We use this idea to design the EXAM Answerability Metric to evaluate information retrieval/generation systems for their ability to provide topically relevant information. We envision the role of a human judge to edit and define an exam question bank that will test for the presence of relevant information in text. We support this step by generating an initial set of exam questions. In the next phase, an LLM-based question answering system will automatically grade system responses by tracking which exam questions are answerable with which system responses. We propose two evaluation measures, the recall-oriented EXAM Cover metric, and the precision-oriented EXAM Qrels metric, the latter which can be implemented with trec_eval. This paradigm not only allows for the expansion of the exam question set post-hoc but also facilitates the ongoing evaluation of future information systems, whether they focus on retrieval, generation, or both.
翻译:当前的IR评估基于相关性判定(无论是人工还是自动生成),其决策过程已外包给大型语言模型(LLMs)。我们提出了一种替代范式,该范式完全不依赖任何形式的相关性判定。取而代之,将文本定义为“相关”的条件是:该文本包含能够回答关键问题的信息。我们利用这一思想设计了EXAM可回答性度量指标,用于评估信息检索/生成系统提供主题相关信息的能力。我们设想由人类评审员编辑和定义一道考试题库,用于检测文本中是否存在相关信息,并通过生成初始题目集来支持这一步骤。在下一阶段,基于LLM的问答系统将通过追踪哪些考试问题能被哪些系统回答来自动评分系统响应。我们提出两种评估指标:面向召回率的EXAM覆盖度指标,以及面向精确率的EXAM Qrels指标(后者可通过trec_eval实现)。该范式不仅支持事后扩展考试问题集,还能促进对未来信息系统的持续评估——无论这些系统专注于检索、生成或两者兼具。