The recent emergence of large language models (LLMs) have attracted considerable attention. LLMs may interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. Without correct comprehension of the dialogue, the model may inevitably generate incorrect responses. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation with the help of the dialogue summarization task. Beside evaluating and analyzing the dialogue summarization performance (DIAC-Sum), we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 27% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest evaluated model, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average accuracy of all evaluated LLMs is only 62.8%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still the most challenging problem for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data. The experimental results demonstrate that our method achieved an accuracy improvement of 8.9% on DIAC-FactQA.
翻译:近年来,大型语言模型的出现引起了广泛关注。这类模型能够以对话形式与用户交互,并根据指令生成回复,这自然要求其具备对话理解能力。若无法正确理解对话内容,模型将不可避免地产生错误回复。然而,对话理解作为一种通用语言能力,难以直接进行评估。本研究提出借助对话摘要任务进行评价。除了评估和分析对话摘要生成性能外,我们还从生成的摘要中提取事实性问题,将其作为对话理解的更灵活度量指标。评估结果显示,大型语言模型生成的摘要平均有27%包含事实不一致性。即便性能最强的ChatGPT模型,其摘要中也有16%存在此类错误。在更具挑战性的事实性问题回答任务中,所有评估模型平均准确率仅为62.8%。这两项结果均表明模型存在严重缺陷。详细分析显示,对话中主客体关系的理解仍是大型语言模型面临的最大难题。此外,为激发并增强大型语言模型的对话理解能力,我们提出了一种利用自动构建多任务数据的微调范式。实验结果表明,该方法在DIAC-FactQA任务上实现了8.9%的准确率提升。