To enhance the domain-specific capabilities of large language models, continued pre-training on a domain-specific corpus is a prevalent method. Recent work demonstrates that adapting models using reading comprehension data formatted by regex-based patterns can significantly improve performance on domain-specific tasks. However, regex-based patterns are incapable of parsing raw corpora using domain-specific knowledge. Furthermore, the question and answer pairs are extracted directly from the corpus in predefined formats offers limited context. To address this limitation, we improve reading comprehension via LLM and clustering. LLM focuses on leveraging domain knowledge within the corpus to refine comprehension stage, while clustering supplies relevant knowledge by extending the context to enrich reading stage. Additionally, our method incorporates parameter-efficient fine-tuning to improve the efficiency of domain adaptation. In comparison to AdaptLLM, our method achieves an improvement exceeding 5% in domain-specific tasks. Our code will available at https://github.com/microsoft/LMOps.
翻译:为增强大语言模型的领域特定能力,在领域特定语料上进行持续预训练是一种常见方法。近期研究表明,使用基于正则表达式模式生成的阅读理解数据对模型进行适配,能显著提升其在领域特定任务上的表现。然而,正则表达式模式无法利用领域特定知识解析原始语料。此外,直接从语料中以预定义格式提取的问答对上下文有限。为克服这一局限,我们通过大语言模型与聚类方法改进阅读理解:大语言模型专注于利用语料中的领域知识优化理解阶段,而聚类通过扩展上下文补充相关知识以丰富阅读阶段。此外,我们的方法引入了参数高效微调技术以提升领域适应效率。与AdaptLLM相比,我们的方法在领域特定任务上实现了超过5%的性能提升。代码将开源至https://github.com/microsoft/LMOps。