LLMs may interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation with the help of the dialogue summarization task. Beside evaluating and analyzing the dialogue summarization performance (DIAC-Sum) of different LLMs, we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 27% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest model evaluated, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average error rate of all evaluated LLMs is 37.2%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still the most challenging problem for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data. The experimental results demonstrate that our method achieved an error rate improvement of 10.9% on DIAC-FactQA.
翻译:大型语言模型(LLM)可能以对话形式与用户交互,并遵循其指令生成回复,这自然要求具备对话理解能力。然而,对话理解作为一种通用语言能力难以直接评估。本文借助对话摘要任务进行评估。除评估和分析不同LLM的对话摘要(DIAC-Sum)性能外,我们还从生成的摘要中提取事实性问题,将其作为对话理解(DIAC-FactQA)的更灵活度量方式。评估表明,LLM生成的摘要平均有27%存在事实不一致性。即便评估中性能最强的ChatGPT,其16%的摘要也存在此类错误。针对更具挑战性的事实性问题回答任务,所有评估LLM的平均错误率为37.2%。两项结果均显示严重缺陷。详细分析表明,对话中主语/宾语的理解仍是LLM最具挑战性的问题。此外,为激发并增强LLM的对话理解能力,我们提出一种基于自动构建多任务数据的微调范式。实验结果表明,我们的方法在DIAC-FactQA上实现了10.9%的错误率改善。