Language Models (LMs) memorize a vast amount of factual knowledge, exhibiting strong performance across diverse tasks and domains. However, it has been observed that the performance diminishes when dealing with less-popular or low-frequency concepts and entities, for example in domain specific applications. The two prominent approaches to enhance the performance of LMs on low-frequent topics are: Retrieval Augmented Generation (RAG) and fine-tuning (FT) over synthetic data. This paper explores and evaluates the impact of RAG and FT on customizing LMs in handling low-frequency entities on question answering tasks. We conduct extensive experiments on twelve LMs of varying size and type and different fine tuning, data augmentation, and retrieval models. Our findings indicate that while FT boosts the performance across entities of varying popularity, RAG surpasses FT by a large margin particularly for least popular factual knowledge. Additionally, the success of both RAG and FT approaches is amplified by improving retrieval and data augmentation techniques. Fine tuning, while beneficial for small LMs, requires extensive resources. To address this issue, we propose the new Stimulus RAG approach that surpasses the effectiveness of fine tuning based approaches, thereby eliminating the need for the costly data augmentation and fine tuning step for enriching LMs with less popular factual knowledge. The code is available at \url{https://github.com/informagi/RAGvsFT}.
翻译:语言模型(LM)能够记忆海量事实性知识,在多样化任务和领域均展现出强大性能。然而,当处理冷门或低频概念与实体时(例如在特定领域应用中),其性能会出现显著下降。目前提升语言模型在低频主题上性能的两种主流方法是:基于合成数据的检索增强生成(RAG)与微调(FT)。本文系统探讨并评估了RAG与FT在定制化语言模型处理低频实体问答任务中的影响。我们针对十二种不同规模与类型的语言模型,结合多种微调、数据增强及检索模型开展了大规模实验。研究发现:虽然FT能够普遍提升各类流行度实体的性能,但RAG在冷门事实性知识处理上显著优于FT,尤其在最低频知识领域表现尤为突出。此外,改进检索与数据增强技术能同步提升RAG与FT方法的效能。尽管微调对小规模语言模型具有增益效果,但其需要消耗大量计算资源。为解决此问题,我们提出了新型Stimulus RAG方法,其效果超越基于微调的方案,从而无需通过昂贵的数据增强与微调步骤即可为语言模型注入冷门事实性知识。相关代码已发布于 \url{https://github.com/informagi/RAGvsFT}。