Structure-aware Domain Knowledge Injection for Large Language Models

This paper introduces a pioneering methodology, termed StructTuning, to efficiently transform foundation Large Language Models (LLMs) into domain specialists. It significantly reduces the training corpus requirement to a mere 0.3%, while achieving an impressive 50% of traditional knowledge injection performance. Our method is inspired by the educational processes of human students, particularly how structured domain knowledge from textbooks is assimilated and subsequently applied to tackle real-world challenges through specific exercises. Based on this, we propose a novel two-stage strategy for knowledge injection and alignment: Structure-aware Continual Pre-Training (SCPT) and Structure-aware Supervised Fine-Tuning (SSFT). In the SCPT phase, we automatically extract the domain knowledge taxonomy and reorganize the training corpora, enabling LLMs to effectively link textual segments to targeted knowledge points within the taxonomy. In the SSFT phase, we explicitly prompt models to elucidate the underlying knowledge structure in their outputs, leveraging the structured domain insight to address practical problems. Our ultimate method has undergone extensive evaluations across model architectures and scales, using closed-book question-answering tasks on LongBench and MMedBench datasets. Remarkably, our method demonstrates the potential of comparable improvement against the state-of-the-art MMedLM2 on MMedBench, while significantly reducing the training costs to 5%. This breakthrough paves the way for scaling up our StructTuning for stronger domain-specific LLMs with comprehensive data utilization. Code is available at https://github.com/alibaba/struxgpt.

翻译：本文提出了一种开创性方法，称为StructTuning，旨在高效地将基础大型语言模型（LLMs）转化为领域专家。该方法显著将训练语料需求降低至仅0.3%，同时实现了传统知识注入性能的50%。我们的方法灵感来源于人类学生的学习过程，特别是如何吸收教科书中的结构化领域知识，并随后通过具体练习应用于解决现实世界挑战。基于此，我们提出了一种新颖的两阶段知识注入与对齐策略：结构感知持续预训练（SCPT）和结构感知监督微调（SSFT）。在SCPT阶段，我们自动提取领域知识分类体系并重组训练语料，使LLMs能够有效将文本片段与分类体系中的目标知识点关联。在SSFT阶段，我们显式提示模型在其输出中阐明底层知识结构，利用结构化领域洞察解决实际问题。我们的最终方法已在多种模型架构和规模上进行了广泛评估，使用LongBench和MMedBench数据集上的闭卷问答任务。值得注意的是，我们的方法在MMedBench上展现出与最先进的MMedLM2相当的改进潜力，同时将训练成本显著降低至5%。这一突破为扩展我们的StructTuning方法以构建更强大的领域专用LLMs并实现全面数据利用铺平了道路。代码可在https://github.com/alibaba/struxgpt获取。