This paper evaluates the extent to which current Large Language Models (LLMs) can capture task-oriented multi-party conversations (MPCs). We have recorded and transcribed 29 MPCs between patients, their companions, and a social robot in a hospital. We then annotated this corpus for multi-party goal-tracking and intent-slot recognition. People share goals, answer each other's goals, and provide other people's goals in MPCs - none of which occur in dyadic interactions. To understand user goals in MPCs, we compared three methods in zero-shot and few-shot settings: we fine-tuned T5, created pre-training tasks to train DialogLM using LED, and employed prompt engineering techniques with GPT-3.5-turbo, to determine which approach can complete this novel task with limited data. GPT-3.5-turbo significantly outperformed the others in a few-shot setting. The `reasoning' style prompt, when given 7% of the corpus as example annotated conversations, was the best performing method. It correctly annotated 62.32% of the goal tracking MPCs, and 69.57% of the intent-slot recognition MPCs. A `story' style prompt increased model hallucination, which could be detrimental if deployed in safety-critical settings. We conclude that multi-party conversations still challenge state-of-the-art LLMs.
翻译:本文评估了当前大规模语言模型(LLMs)在捕捉面向任务的多方对话(MPCs)方面的能力。我们记录了医院中患者、陪护人员与社交机器人之间的29组MPCs并进行了转写,随后对该语料库进行了多方目标追踪与意图槽位识别的标注。在MPCs中,参与者会共享目标、回应他人目标并提供他人目标——这些行为均不存在于二元互动中。为理解MPCs中的用户目标,我们在零样本和少样本场景下比较了三种方法:微调T5、基于LED创建预训练任务训练DialogLM,以及采用GPT-3.5-turbo的提示工程技术,以确定何种方法能在有限数据下完成这一新型任务。在少样本场景中,GPT-3.5-turbo显著优于其他方法。当使用语料库7%的示例标注对话时,“推理”风格提示成为表现最佳的方法,正确标注了62.32%的目标追踪MPCs和69.57%的意图槽位识别MPCs。而“故事”风格提示增加了模型幻觉,若部署在安全关键场景中可能产生不良后果。我们得出结论:多方对话对当前最先进的LLMs仍构成挑战。