The effective training and evaluation of retrieval systems require a substantial amount of relevance judgments, which are traditionally collected from human assessors -- a process that is both costly and time-consuming. Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM, such as GPT-4, which, despite being effective, are expensive and prone to intra-model biases that can favour systems leveraging similar models. In this work, we introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark [18], we compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. Our results show that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.
翻译:检索系统的有效训练与评估需要大量的相关性判断,这些判断传统上由人工评估员收集——这一过程既昂贵又耗时。大型语言模型(LLMs)在生成搜索任务的相关性标签方面显示出潜力,为人工评估提供了一种潜在的替代方案。当前方法通常依赖单一LLM(例如GPT-4),尽管有效,但成本高昂且容易受到模型内部偏差的影响,这种偏差可能偏向于利用类似模型的系统。在本工作中,我们提出了JudgeBlender,这是一个通过结合多个LLM(LLMBlender)或多个提示(PromptBlender)的评估来利用更小、开源模型提供相关性判断的框架。通过利用LLMJudge基准[18],我们将JudgeBlender与最先进的方法以及LLMJudge挑战中的顶尖表现者进行了比较。我们的结果表明,JudgeBlender实现了具有竞争力的性能,证明了对于可靠的相关性评估而言,超大型模型通常并非必需。