While ongoing advancements in Large Language Models have demonstrated remarkable success across various NLP tasks, Retrieval Augmented Generation Model stands out to be highly effective on downstream applications like Question Answering. Recently, RAG-end2end model further optimized the architecture and achieved notable performance improvements on domain adaptation. However, the effectiveness of these RAG-based architectures remains relatively unexplored when fine-tuned on specialized domains such as customer service for building a reliable conversational AI system. Furthermore, a critical challenge persists in reducing the occurrence of hallucinations while maintaining high domain-specific accuracy. In this paper, we investigated the performance of diverse RAG and RAG-like architectures through domain adaptation and evaluated their ability to generate accurate and relevant response grounded in the contextual knowledge base. To facilitate the evaluation of the models, we constructed a novel dataset HotelConvQA, sourced from wide range of hotel-related conversations and fine-tuned all the models on our domain specific dataset. We also addressed a critical research gap on determining the impact of domain adaptation on reducing hallucinations across different RAG architectures, an aspect that was not properly measured in prior work. Our evaluation shows positive results in all metrics by employing domain adaptation, demonstrating strong performance on QA tasks and providing insights into their efficacy in reducing hallucinations. Our findings clearly indicate that domain adaptation not only enhances the models' performance on QA tasks but also significantly reduces hallucination across all evaluated RAG architectures.
翻译:尽管大型语言模型的持续进展已在各种自然语言处理任务中展现出显著成功,但检索增强生成模型在问答等下游应用中表现出卓越效能。近期,RAG-end2end模型进一步优化了架构,在领域自适应方面取得了显著性能提升。然而,当针对客户服务等专业领域进行微调以构建可靠对话AI系统时,这些基于RAG的架构有效性仍未得到充分探索。此外,在保持高领域特定准确性的同时减少幻觉发生这一关键挑战依然存在。本文通过领域自适应研究了多种RAG及类RAG架构的性能,并评估了其基于上下文知识库生成准确相关响应的能力。为促进模型评估,我们构建了新颖的HotelConvQA数据集,该数据集源自广泛的酒店相关对话,并将所有模型在我们的领域特定数据集上进行了微调。我们还解决了衡量领域自适应对不同RAG架构减少幻觉影响的关键研究空白——这一方面在先前工作中未得到充分评估。实验结果表明,采用领域自适应后所有指标均呈现积极结果,在问答任务中展现出强劲性能,并揭示了其在减少幻觉方面的有效性。我们的研究明确表明,领域自适应不仅能提升模型在问答任务中的性能,还能显著降低所有评估RAG架构中的幻觉现象。