Efficient knowledge management plays a pivotal role in augmenting both the operational efficiency and the innovative capacity of businesses and organizations. By indexing knowledge through vectorization, a variety of knowledge retrieval methods have emerged, significantly enhancing the efficacy of knowledge management systems. Recently, the rapid advancements in generative natural language processing technologies paved the way for generating precise and coherent answers after retrieving relevant documents tailored to user queries. However, for enterprise knowledge bases, assembling extensive training data from scratch for knowledge retrieval and generation is a formidable challenge due to the privacy and security policies of private data, frequently entailing substantial costs. To address the challenge above, in this paper, we propose EKRG, a novel Retrieval-Generation framework based on large language models (LLMs), expertly designed to enable question-answering for Enterprise Knowledge bases with limited annotation costs. Specifically, for the retrieval process, we first introduce an instruction-tuning method using an LLM to generate sufficient document-question pairs for training a knowledge retriever. This method, through carefully designed instructions, efficiently generates diverse questions for enterprise knowledge bases, encompassing both fact-oriented and solution-oriented knowledge. Additionally, we develop a relevance-aware teacher-student learning strategy to further enhance the efficiency of the training process. For the generation process, we propose a novel chain of thought (CoT) based fine-tuning method to empower the LLM-based generator to adeptly respond to user questions using retrieved documents. Finally, extensive experiments on real-world datasets have demonstrated the effectiveness of our proposed framework.
翻译:高效知识管理在提升企业和机构的运营效率与创新能力方面发挥着关键作用。通过向量化索引知识,多种知识检索方法应运而生,显著增强了知识管理系统的效能。近年来,生成式自然语言处理技术的快速进步为根据用户查询检索相关文档后生成精准连贯的答案开辟了新途径。然而,对于企业知识库而言,由于私有数据的隐私与安全策略,从零开始构建大规模训练数据用于知识检索与生成是一项严峻挑战,且往往需要高昂成本。为应对上述挑战,本文提出EKRG,一种基于大型语言模型(LLMs)的新型检索-生成框架,该框架专为在有限标注成本下实现企业知识库问答而设计。具体而言,在检索过程中,我们首先引入一种指令微调方法,利用LLM生成充足的文档-问题对来训练知识检索器。该方法通过精心设计的指令,高效为企业知识库生成多样化问题,涵盖面向事实的知识与面向解决方案的知识。此外,我们开发了一种相关性感知的师生学习策略,进一步提升训练过程的效率。在生成过程中,我们提出一种基于思维链(CoT)的新型微调方法,使基于LLM的生成器能够利用检索到的文档妥善回答用户问题。最后,在真实数据集上的广泛实验证明了我们提出的框架的有效性。