The potential of using Large Language Models (LLMs) themselves to evaluate LLM outputs offers a promising method for assessing model performance across various contexts. Previous research indicates that LLM-as-a-judge exhibits a strong correlation with human judges in the context of general instruction following. However, for instructions that require specialized knowledge, the validity of using LLMs as judges remains uncertain. In our study, we applied a mixed-methods approach, conducting pairwise comparisons in which both subject matter experts (SMEs) and LLMs evaluated outputs from domain-specific tasks. We focused on two distinct fields: dietetics, with registered dietitian experts, and mental health, with clinical psychologist experts. Our results showed that SMEs agreed with LLM judges 68% of the time in the dietetics domain and 64% in mental health when evaluating overall preference. Additionally, the results indicated variations in SME-LLM agreement across domain-specific aspect questions. Our findings emphasize the importance of keeping human experts in the evaluation process, as LLMs alone may not provide the depth of understanding required for complex, knowledge specific tasks. We also explore the implications of LLM evaluations across different domains and discuss how these insights can inform the design of evaluation workflows that ensure better alignment between human experts and LLMs in interactive systems.
翻译:利用大语言模型自身评估LLM输出的潜力,为评估模型在不同情境下的性能提供了一种前景广阔的方法。先前研究表明,在通用指令遵循场景中,LLM-as-a-judge与人类评估者表现出高度相关性。然而,对于需要专业知识的指令,使用LLM作为评估者的有效性仍不确定。在本研究中,我们采用混合方法进行配对比较,由领域专家和LLM共同评估特定领域任务的输出结果。我们聚焦于两个不同领域:由注册营养师专家参与的膳食营养学,以及由临床心理学家专家参与的心理健康领域。结果显示,在整体偏好评估中,领域专家与LLM评估者在膳食营养领域的吻合度为68%,在心理健康领域为64%。此外,研究结果表明领域特定维度问题上专家与LLM的评估一致性存在差异。我们的发现强调了在评估过程中保留人类专家的必要性,因为仅靠LLM可能无法提供复杂专业知识任务所需的深度理解。我们还探讨了LLM评估在不同领域的影响,并讨论了如何通过这些发现来设计评估工作流程,以确保交互系统中人类专家与LLM之间更好的协同。