Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human's intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners' preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.
翻译:传统的基于参考的指标(如BLEU和ROUGE)在评估大型语言模型(LLMs)的输出时效果有限,尤其是当LLM生成极具创造性或高质量文本,或参考输出不可得时。尽管人工评估仍是一种选择,但其成本高昂且难以扩展。近期利用LLMs作为评估者(LLM-as-a-judge)的研究展现出潜力,但信任与可靠性仍是重大关切。整合人类输入对于确保评估标准与人类意图一致,以及评估的稳健性和一致性至关重要。本文介绍了一项名为EvaluLLM的设计探索的用户研究,该设计使用户能够将LLMs作为可定制的评估者加以利用,在谨慎权衡信任与成本节约潜力的同时促进人类参与。通过对八位领域专家的访谈,我们识别出在制定有效评估标准以对齐LLM-as-a-judge与从业者偏好及期望方面存在辅助需求。我们提出了优化人机协同LLM-as-judge系统的研究发现与设计建议。