The rapid advancements in large language models (LLMs) have presented challenges in evaluating those models. Existing evaluation methods are either reference-based or preference based, which inevitably need human intervention or introduce test bias caused by evaluator models. In this paper, we propose GameEval, a novel approach to evaluating LLMs through goal-driven conversational games, overcoming the limitations of previous methods. GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms, including discussion, question answering, and voting. We design three unique games with cooperative or adversarial objectives, accompanied by corresponding evaluation metrics, to show how this new paradigm comprehensively evaluates model performance.Through extensive experiments, we show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems. Our public anonymous code is available at https://github.com/GameEval/GameEval.
翻译:大语言模型的快速发展给其评估带来了新挑战。现有评估方法要么基于参考标准,要么基于偏好判断,不可避免地需要人工干预或引入评估模型带来的测试偏差。本文提出GameEval——一种通过目标驱动的博弈式对话游戏评估大语言模型的新范式,克服了以往方法的局限性。GameEval将大语言模型视为博弈参与者,赋予其不同角色与通过多种对话形式(包括讨论、问答和投票)实现特定目标。我们设计了三类具有合作或对抗目标的独特博弈游戏,并配套相应评估指标,展示这一新范式如何全面评估模型性能。通过大量实验证明,GameEval能有效区分不同大语言模型的能力,对其解决复杂问题的综合能力进行全面评估。我们的匿名代码已开源在https://github.com/GameEval/GameEval。