While large language models (LLMs) like GPT-4 have recently demonstrated astonishing zero-shot capabilities in general domain tasks, they often generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas. This is typically due to the absence of training data that encompasses such a specific domain, preventing GPT-4 from acquiring in-domain knowledge. A pressing challenge is that it's not plausible to continue training LLMs of such scale on in-domain data. This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an \textbf{adapt-retrieve-revise} process. The initial step is to \textbf{adapt} an affordable 7B LLM to the target domain by continuing learning on in-domain data. When solving a task, we leverage the adapted LLM to generate a draft answer given a task query. Then, the draft answer will be used to \textbf{retrieve} supporting evidence candidates from an external in-domain knowledge base. Finally, the draft answer and retrieved evidence are concatenated into a whole prompt to let GPT-4 assess the evidence and \textbf{revise} the draft answer to generate the final answer. Our proposal combines the advantages of the efficiency of adapting a smaller 7B model with the evidence-assessing capability of GPT-4 and effectively prevents GPT-4 from generating hallucinatory content. In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3\% compared to the direct generation by GPT-4. When compared to two stronger retrieval-based baselines, our method outperforms them by 15.4\% and 23.9\%. Our code will be released
翻译:尽管以GPT-4为代表的大型语言模型在通用领域任务中展现了惊人的零样本能力,但在中国法律等特定领域常生成包含幻觉的内容,阻碍了其在这些领域的应用。这通常源于训练数据缺乏特定领域内容,导致GPT-4无法获得领域内知识。一个紧迫挑战是,对如此规模的大语言模型进行领域数据持续训练并不现实。本文通过将生成过程重构为\textbf{适应-检索-修正}流程,提出一种简单有效的GPT-4领域适应框架。第一步是对可负担的7B大语言模型在领域数据上进行持续学习,使其\textbf{适应}目标领域。执行任务时,我们利用适应后的模型根据任务查询生成草稿答案。随后,该草稿答案被用于从外部领域知识库中\textbf{检索}支持性证据候选。最后,将草稿答案与检索证据拼接为完整提示,让GPT-4评估证据并\textbf{修正}草稿答案生成最终结果。该方法结合了适应小规模7B模型的高效性与GPT-4的证据评估能力,有效防止GPT-4生成幻觉内容。在四个中国法律任务的零样本设置中,相比GPT-4直接生成,本方法准确率提升33.3%;与两种更强的基于检索的基线方法相比,分别超出15.4%和23.9%。相关代码将开源。