Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities, refreshing human's impressions on dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users by satisfying the need for communication, affection and social belonging. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that currently contains $12$ dialogue tasks to assess the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely-used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive test over $28$ LLMs (including pre-trained and supervised instruction-tuning) shows that instruction fine-tuning benefits improve the human likeness of LLMs to a certain extent, but there is still much room to improve those capabilities for most LLMs as human-like dialogue systems. In addition, experimental results also indicate that LLMs perform differently in various abilities that human-like dialogue systems should have. We will publicly release DialogBench, along with the associated evaluation code for the broader research community.
翻译:大语言模型(LLMs)在对话能力上取得了显著突破,刷新了人类对对话系统的认知。对话系统的长期目标是具备足够的类人特性,通过满足交流、情感及社会归属需求,与用户建立长期联系。因此,当前亟需评估LLMs作为类人对话系统的能力。本文提出对话基准测试(DialogBench),一个包含12项对话任务的评估基准,旨在衡量LLMs作为类人对话系统应具备的能力。具体而言,我们使用GPT-4为每项任务生成评估实例:首先基于广泛采用的设计原则构建基础提示,并进一步消除现有偏差以生成更高质量的评估实例。我们对28个LLMs(包括预训练模型和监督指令微调模型)的广泛测试表明,指令微调能在一定程度上提升LLMs的类人性,但大多数LLMs作为类人对话系统的能力仍有较大提升空间。此外,实验结果还显示,LLMs在类人对话系统应具备的各项能力上表现各异。我们将公开DialogBench及其配套评估代码,以供更广泛的科研社区使用。