We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama-2 and Mistral, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.
翻译:本研究探讨了大语言模型在人机对话响应生成任务中的局限性。现有文献针对不同对话类型(如开放域对话)提出了多种适应技术,但这些技术的评估在基础模型、对话类型和评估指标方面存在局限。本文系统分析了不同LLM适应技术应用于多种对话类型时的表现。我们选取Llama-2和Mistral作为基础模型,覆盖开放域、知识驱动、任务导向和问答四种对话类型。通过各对话类型对应的数据集,我们评估了上下文学习与微调技术的性能,并考察了在检索增强生成(RAG)和标准知识两种场景下引入外部知识对生成效果的影响。研究采用统一的自动评估指标与人工评估协议,并建立可解释性标准。分析表明,不存在适用于所有场景的通用最佳适应技术,每种技术的有效性同时取决于基础模型和具体对话类型。最后需要强调的是,最佳适应技术的评估必须包含人工评价环节,以避免仅依赖自动指标可能产生的错误预期和结论。