High-quality relevance judgements over large query sets are essential for evaluating Information Retrieval (IR) systems, yet manual annotation remains costly and time-consuming. Large Language Models (LLMs) have recently shown promise as automatic relevance assessors, but their reliability is still limited. Most existing approaches rely on zero-shot prompting or In-Context Learning (ICL) with a small number of labeled examples. However, standard ICL treats examples as independent instances and fails to explicitly capture the underlying relevance criteria of a topic, restricting its ability to generalize to unseen query-document pairs. To address this limitation, we introduce Relevance Context Learning (RCL), a novel framework that leverages human relevance judgements to explicitly model topic-specific relevance criteria. Rather than directly using labeled examples for in-context prediction, RCL first prompts an LLM (Instructor LLM) to analyze sets of judged query-document pairs and generate explicit narratives that describe what constitutes relevance for a given topic. These relevance narratives are then used as structured prompts to guide a second LLM (Assessor LLM) in producing relevance judgements. To evaluate RCL in a realistic data collection setting, we propose a hybrid pooling strategy in which a shallow depth-\textit{k} pool from participating systems is judged by human assessors, while the remaining documents are labeled by LLMs. Experimental results demonstrate that RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL. Overall, our findings indicate that transforming relevance examples into explicit, context-aware relevance narratives is a more effective way of exploiting human judgements for LLM-based IR dataset construction.
翻译:在大规模查询集上进行高质量相关性判定对于信息检索系统评估至关重要,但人工标注仍然成本高昂且耗时。大语言模型近期展现出作为自动相关性评估器的潜力,但其可靠性仍存在局限。现有方法大多依赖零样本提示或使用少量标注样本的上下文学习。然而,标准上下文学习将示例视为独立实例,未能显式捕捉主题底层的相关性判定标准,限制了其对未见查询-文档对的泛化能力。为突破此限制,我们提出相关性上下文学习框架,该创新框架利用人工相关性判定结果显式建模主题特定的相关性标准。RCL并非直接使用标注样本进行上下文预测,而是首先提示指导型大语言模型分析已标注的查询-文档对集合,生成明确描述特定主题相关性构成要素的叙述性文本。这些相关性叙述随后作为结构化提示,指导评估型大语言模型生成相关性判定结果。为在实际数据收集场景中评估RCL,我们提出混合池化策略:参与系统生成的浅层深度-\textit{k}池化结果由人工评估员判定,其余文档则由大语言模型标注。实验结果表明,RCL显著优于零样本提示方法,并持续超越标准上下文学习。总体而言,我们的研究发现将相关性示例转化为显式的、上下文感知的相关性叙述,是为基于大语言模型的信息检索数据集构建而利用人工判定结果更有效的方式。