There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable interactions, while stressing goal-directedness: dialogue game based evaluation. While the utility of this approach has been shown by several projects, its adoption has been held back by the lack of a mature, easily re-usable implementation. In this paper, we present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use. We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English), as well as how easily the benchmark itself can be extended with new, tailor-made targeted tests.
翻译:当前大语言模型(LLM)评估主要存在两种范式:基于参考的评估和基于偏好的评估。第一种范式沿袭自一般机器学习模型的评估方法,依赖于预定义的任务实例,这些实例均配有参考执行结果。第二种范式以LM-arena为典型代表,通常由(常为自主选择的)用户将自身意图提交至平台,平台将这些意图并行路由至多个模型,用户随后从生成的响应中选择最偏好的结果。因此,前一种范式在测试内容的可控性方面表现突出,而后一种范式则具有更高的生态效度,能够以交互方式测试实际用例。近年来,第三种互补性范式应运而生,它融合了上述两种范式的部分优势,在强调目标导向性的同时,提供了对多轮次、无参考、可重复交互的控制能力:基于对话博弈的评估。尽管多个项目已证明该方法的实用性,但由于缺乏成熟且易于复用的实现方案,其推广应用一直受到制约。本文介绍自2023年起持续开发并在最新版本中针对通用易用性进行优化的clembench平台。我们将阐述如何利用该平台(通过提供的一套英文基准博弈实例)对自有模型进行基准测试,并说明如何便捷地扩展基准测试集以纳入新的定制化定向测试。