This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of "the Generative AI Paradox" (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.
翻译:本文探讨了这样一种假设:擅长生成任务的大型语言模型(LLM)同样善于担任评估者。我们利用TriviaQA数据集(Joshi等,2017),评估了三种大型语言模型和一种开源语言模型在问答与评估任务中的表现。结果表明存在显著差异——LLM在评估任务中的表现明显低于生成任务。有趣的是,我们发现存在不忠实的评估现象:模型能在自身能力不足的领域准确评估答案,这凸显了审视LLM作为评估者的忠实性与可信赖性的必要性。本研究深化了对"生成式AI悖论"(West等,2023)的理解,强调需探究生成能力与评估专长之间的相关性,以及审视模型评估中忠实性维度的迫切性。