Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations

As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these assessments. Experiment 2 extended the work of Finch et al. (2023) by focusing on dyadic dialogues and assessing Commonsense Contradiction, Incorrect Fact, and Redundancy. The results indicate that while GPT-4o demonstrates strong performance in maintaining factual accuracy and commonsense reasoning, it still struggles with reducing redundancy and self-contradiction. Our findings underscore the potential of GPT models to closely replicate human evaluation in dialogue systems, while also pointing to areas for improvement. This research offers valuable insights for advancing the development and implementation of more refined dialogue evaluation methodologies, contributing to the evolution of more effective and human-like AI communication tools.

翻译：随着对话系统和聊天机器人日益融入日常交互，对高效且准确的评估方法的需求变得至关重要。本研究探讨了人类与AI评估在一系列对话场景中的比较性能，重点关注七个关键性能指标：连贯性、创新性、具体性、目标贡献度、常识矛盾、事实错误和冗余性。利用GPT-4o API，我们生成了一个多样化的对话数据集，并进行了两部分实验分析。在实验1中，我们评估了多方对话的连贯性、创新性、具体性和目标贡献度，结果表明GPT模型的评估与人类判断高度一致。值得注意的是，人类和AI评估者都表现出倾向于二元判断而非线性评分，突显了此类评估中存在的共同挑战。实验2扩展了Finch等人（2023）的工作，聚焦于双人对话并评估常识矛盾、事实错误和冗余性。结果显示，尽管GPT-4o在保持事实准确性和常识推理方面表现出色，但在减少冗余和自我矛盾方面仍存在不足。我们的研究结果强调了GPT模型在对话系统评估中紧密复现人类评估的潜力，同时也指出了需要改进的领域。这项研究为推进更精细的对话评估方法的开发与应用提供了宝贵见解，有助于推动更有效、更类人的人工智能通信工具的发展。