Once powerful conversational models have become available for a wide audience, users started actively engaging in social interactions with this technology. Such unprecedented interaction experiences may pose considerable social and psychological risks to the users unless the technology is properly controlled. This creates an urgent need for scalable and robust evaluation metrics for conversational chatbots. Existing automatic evaluation metrics usually focus on objective quality measures and disregard subjective perceptions of social dimensions. Moreover, most of these approaches operate on pre-produced dialogs from available benchmark corpora, which implies human involvement for preparing the material for evaluation and, thus, impeded scalability of the metrics. To address this limitation, we propose to make use of the emerging large language models (LLMs) from the GPT-family and describe a new framework allowing to conduct dialog system evaluation with prompting. With this framework, we are able to achieve full automation of the evaluation pipeline and reach impressive correlation with the human judgement (up to Pearson r=0.95 on system level). The underlying concept is to collect synthetic chat logs of evaluated bots with a LLM in the other-play setting, where LLM is carefully conditioned to follow a specific scenario. We further explore different prompting approaches to produce evaluation scores with the same LLM. The best-performing prompts, containing few-show demonstrations and instructions, show outstanding performance on the tested dataset and demonstrate the ability to generalize to other dialog corpora.
翻译:一旦强大的对话模型对广泛用户可用,用户便开始主动与技术进行社交互动。这种前所未有的互动体验可能给用户带来重大的社交和心理风险,除非技术得到适当控制。这催生了对可扩展且稳健的对话聊天机器人评估指标的迫切需求。现有的自动评估指标通常侧重于客观质量度量,忽视了社交维度的主观感知。此外,多数方法基于现有基准语料库中的预生成对话,这意味着需要人力准备评估材料,从而限制了指标的可扩展性。为克服这一局限,我们提出利用GPT系列新兴的大型语言模型(LLMs),并描述了一种通过提示进行对话系统评估的新框架。借助此框架,我们能够实现评估流程的完全自动化,并与人类判断达到显著的相关性(系统级皮尔逊相关系数高达 r=0.95)。其核心概念是在其他玩家设置中,通过LLM收集被评估机器人的合成聊天日志,其中LLM被精心条件化以遵循特定场景。我们进一步探索了使用同一LLM生成评估分数的不同提示方法。包含少量示例演示和指令的最佳提示在测试数据集上表现卓越,并展现出对其他对话语料库的泛化能力。