Developing high-performing dialogue systems benefits from the automatic identification of undesirable behaviors in system responses. However, detecting such behaviors remains challenging, as it draws on a breadth of general knowledge and understanding of conversational practices. Although recent research has focused on building specialized classifiers for detecting specific dialogue behaviors, the behavior coverage is still incomplete and there is a lack of testing on real-world human-bot interactions. This paper investigates the ability of a state-of-the-art large language model (LLM), ChatGPT-3.5, to perform dialogue behavior detection for nine categories in real human-bot dialogues. We aim to assess whether ChatGPT can match specialized models and approximate human performance, thereby reducing the cost of behavior detection tasks. Our findings reveal that neither specialized models nor ChatGPT have yet achieved satisfactory results for this task, falling short of human performance. Nevertheless, ChatGPT shows promising potential and often outperforms specialized detection models. We conclude with an in-depth examination of the prevalent shortcomings of ChatGPT, offering guidance for future research to enhance LLM capabilities.
翻译:开发高性能对话系统受益于自动识别系统响应中的不良行为。然而,检测此类行为仍具挑战性,因为它依赖于广泛的一般知识和对话实践的理解。尽管近期研究专注于构建检测特定对话行为的专用分类器,但行为覆盖仍不完整,且缺乏在真实人机交互场景中的测试。本文探究了当前最先进的大型语言模型(LLM)ChatGPT-3.5在真实人机对话中针对九类行为进行检测的能力。我们旨在评估ChatGPT能否媲美专用模型并接近人类表现,从而降低行为检测任务的成本。研究结果表明,无论是专用模型还是ChatGPT均未在该任务上取得令人满意的结果,与人类表现存在差距。尽管如此,ChatGPT展现出巨大潜力,且通常优于专用检测模型。我们最后深入分析了ChatGPT的常见缺陷,为未来增强LLM能力的研究提供指导。