Large Language Models (LLMs) have made progress in various real-world tasks, which stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are mainly supervised signal-based which depends on static datasets and cannot evaluate the ability of LLMs in dynamic real-world scenarios where deep interaction widely exists. Other LLM evaluation methods are human-based which are costly and time-consuming and are incapable of large-scale evaluation of LLMs. To address the issues above, we propose a novel Deep Interaction-based LLM-evaluation framework. In our proposed framework, LLMs' performances in real-world domains can be evaluated from their deep interaction with other LLMs in elaborately designed evaluation tasks. Furthermore, our proposed framework is a general evaluation method that can be applied to a host of real-world tasks such as machine translation and code generation. We demonstrate the effectiveness of our proposed method through extensive experiments on four elaborately designed evaluation tasks.
翻译:大规模语言模型在各类现实任务中取得了显著进展,这催生了对其评估方法的迫切需求。现有的大规模语言模型评估方法主要基于监督信号,依赖静态数据集,无法评估在广泛存在深度交互的动态现实场景中大规模语言模型的能力。另一些评估方法依赖人类参与,成本高、耗时长,难以实现大规模评估。为解决上述问题,我们提出了一种新颖的基于深度交互的大规模语言模型评估框架。在该框架中,通过精心设计的评估任务,大规模语言模型与其他模型进行深度交互,从而评估其在现实领域中的表现。此外,我们提出的框架是一种通用评估方法,可应用于机器翻译、代码生成等多项现实任务。通过在四个精心设计的评估任务上进行大量实验,我们证明了该方法的有效性。