JudgeBlender：集成判断以实现自动相关性评估 (JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment)

The effective training and evaluation of retrieval systems require a substantial amount of relevance judgments, which are traditionally collected from human assessors -- a process that is both costly and time-consuming. Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM, such as GPT-4, which, despite being effective, are expensive and prone to intra-model biases that can favour systems leveraging similar models. In this work, we introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark [18], we compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. Our results show that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.

翻译：检索系统的有效训练与评估需要大量的相关性判断，这些判断传统上由人工评估员收集——这一过程既昂贵又耗时。大型语言模型（LLMs）在生成搜索任务的相关性标签方面显示出潜力，为人工评估提供了一种潜在的替代方案。当前方法通常依赖单一LLM（例如GPT-4），尽管有效，但成本高昂且容易受到模型内部偏差的影响，这种偏差可能偏向于利用类似模型的系统。在本工作中，我们提出了JudgeBlender，这是一个通过结合多个LLM（LLMBlender）或多个提示（PromptBlender）的评估来利用更小、开源模型提供相关性判断的框架。通过利用LLMJudge基准[18]，我们将JudgeBlender与最先进的方法以及LLMJudge挑战中的顶尖表现者进行了比较。我们的结果表明，JudgeBlender实现了具有竞争力的性能，证明了对于可靠的相关性评估而言，超大型模型通常并非必需。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日