TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine

Pre-training and fine-tuning have emerged as a promising paradigm across various natural language processing (NLP) tasks. The effectiveness of pretrained large language models (LLM) has witnessed further enhancement, holding potential for applications in the field of medicine, particularly in the context of Traditional Chinese Medicine (TCM). However, the application of these general models to specific domains often yields suboptimal results, primarily due to challenges like lack of domain knowledge, unique objectives, and computational efficiency. Furthermore, their effectiveness in specialized domains, such as Traditional Chinese Medicine, requires comprehensive evaluation. To address the above issues, we propose a novel domain specific TCMDA (TCM Domain Adaptation) approach, efficient pre-training with domain-specific corpus. Specifically, we first construct a large TCM-specific corpus, TCM-Corpus-1B, by identifying domain keywords and retreving from general corpus. Then, our TCMDA leverages the LoRA which freezes the pretrained model's weights and uses rank decomposition matrices to efficiently train specific dense layers for pre-training and fine-tuning, efficiently aligning the model with TCM-related tasks, namely TCM-GPT-7B. We further conducted extensive experiments on two TCM tasks, including TCM examination and TCM diagnosis. TCM-GPT-7B archived the best performance across both datasets, outperforming other models by relative increments of 17% and 12% in accuracy, respectively. To the best of our knowledge, our study represents the pioneering validation of domain adaptation of a large language model with 7 billion parameters in TCM domain. We will release both TCMCorpus-1B and TCM-GPT-7B model once accepted to facilitate interdisciplinary development in TCM and NLP, serving as the foundation for further study.

翻译：预训练与微调已成为各类自然语言处理任务中一种前景广阔的研究范式。预训练大型语言模型（LLM）的有效性已得到进一步提升，在医学领域尤其是中医情境中展现出应用潜力。然而，将这些通用模型应用于特定领域时，常因领域知识匮乏、目标独特性及计算效率低下等挑战而效果欠佳。此外，它们在中医学等专业领域的有效性尚需全面评估。为解决上述问题，我们提出一种新颖的领域特异性中医领域自适应方法——TCMDA，即基于领域专用语料的高效预训练。具体而言，我们首先通过识别领域关键词并从通用语料中检索，构建了大规模中医专用语料库TCM-Corpus-1B。随后，我们的TCMDA方法采用LoRA技术，冻结预训练模型权重，利用秩分解矩阵高效训练特定密集层以进行预训练与微调，从而实现模型与中医相关任务的精准对齐，即TCM-GPT-7B。我们进一步在两项中医任务（包括中医问诊与中医辨证）上开展了广泛实验。TCM-GPT-7B在两个数据集上均取得了最优性能，准确率分别相对提升17%和12%。据我们所知，本研究首次验证了70亿参数大型语言模型在中医领域的领域自适应性。论文被接收后，我们将公开发布TCM-Corpus-1B语料库与TCM-GPT-7B模型，以促进中医与自然语言处理的交叉学科发展，并为后续研究奠定基础。