We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama-2 and Mistral, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.
翻译:本研究探讨了大型语言模型在人机对话响应生成任务中的局限性。文献中已针对不同对话类型(例如开放域对话)提出了多种适应技术,但这些技术的评估在基础模型、对话类型和评价指标方面均存在局限。本文对不同LLM适应技术应用于多种对话类型的效果进行了全面分析。我们选取了Llama-2和Mistral两种基础LLM,以及开放域对话、知识驱动对话、任务导向对话和问答对话四种对话类型。针对每种对话类型选取的数据集,我们评估了上下文学习和微调技术的性能表现。同时考察了在检索增强生成场景和标准知识场景中引入外部知识对生成结果的影响。我们采用统一的评估标准和可解释性框架,涵盖自动评价指标和人工评估方案。分析表明,并不存在适用于所有场景的最佳适应技术,因为每种技术的有效性既取决于基础LLM的特性,也与具体对话类型密切相关。最后需要强调的是,最佳适应技术的评估必须包含人工评价环节,以避免仅依赖自动指标可能产生的错误预期和结论。