CooK: Empowering General-Purpose Language Models with Modular and Collaborative Knowledge

Large language models (LLMs) are increasingly adopted for knowledge-intensive tasks and contexts. Existing approaches improve the knowledge capabilities of general-purpose LLMs through retrieval or generated knowledge prompting, but they fall short of reflecting two key properties of knowledge-rich models: knowledge should be modular, ever-growing, sourced from diverse domains; knowledge acquisition and production should be a collaborative process, where diverse stakeholders contribute new information. To this end, we propose CooK, a novel framework to empower general-purpose large language models with modular and collaboratively sourced knowledge. We first introduce specialized language models, autoregressive models trained on corpora from a wide range of domains and sources. These specialized LMs serve as parametric knowledge repositories that are later prompted to generate background knowledge for general-purpose LLMs. We then propose three knowledge filters to dynamically select and retain information in generated documents by controlling for relevance, brevity, and factuality. Finally, we propose bottom-up and top-down knowledge integration approaches to augment general-purpose LLMs with the curated (relevant, factual) knowledge from community-driven specialized LMs that enable multi-domain knowledge synthesis and on-demand knowledge requests. Through extensive experiments, we demonstrate that CooK achieves state-of-the-art performance on six benchmark datasets. Our results highlight the potential of enriching general-purpose LLMs with evolving and modular knowledge -- relevant knowledge that can be continuously updated through the collective efforts of the research community.

翻译：大型语言模型（LLMs）正越来越多地被用于知识密集型任务与场景。现有方法通过检索或生成式知识提示来提升通用LLMs的知识能力，但未能充分反映知识丰富模型的两大核心属性：知识应具有模块性、持续增长性，并源自多样化领域；知识的获取与生产应是一个协作过程，由不同利益相关方贡献新信息。为此，我们提出CooK——一种新型框架，旨在为通用大型语言模型赋予模块化且协作来源的知识。首先，我们引入专用语言模型，即基于广泛领域与来源语料训练的回归模型。这些专用LM充当参数化知识存储库，随后通过提示为通用LLMs生成背景知识。接着，我们提出三种知识过滤器，通过控制相关性、简洁性与事实性，动态筛选并保留生成文档中的信息。最后，我们提出自下而上与自上而下的知识整合方法，利用来自社区驱动的专用LM所筛选的（相关、事实性）知识增强通用LLMs，实现多领域知识合成与按需知识请求。通过大量实验，我们证明了CooK在六个基准数据集上达到最先进性能。实验结果凸显了利用可演化、模块化知识（即通过研究社区集体努力可持续更新的相关知识）丰富通用LLMs的潜力。