Recent advancements in large language models (LLMs) have shown promising results across a variety of natural language processing (NLP) tasks. The application of LLMs to specific domains, such as biomedicine, has achieved increased attention. However, most biomedical LLMs focus on enhancing performance in monolingual biomedical question answering and conversation tasks. To further investigate the effectiveness of the LLMs on diverse biomedical NLP tasks in different languages, we present Taiyi, a bilingual (English and Chinese) fine-tuned LLM for diverse biomedical tasks. In this work, we first curated a comprehensive collection of 140 existing biomedical text mining datasets across over 10 task types. Subsequently, a two-stage strategy is proposed for supervised fine-tuning to optimize the model performance across varied tasks. Experimental results on 13 test sets covering named entity recognition, relation extraction, text classification, question answering tasks demonstrate Taiyi achieves superior performance compared to general LLMs. The case study involving additional biomedical NLP tasks further shows Taiyi's considerable potential for bilingual biomedical multi-tasking. The source code, datasets, and model for Taiyi are freely available at https://github.com/DUTIR-BioNLP/Taiyi-LLM.
翻译:近期大型语言模型(LLMs)的进展已在多种自然语言处理任务中展现出令人振奋的结果。将LLMs应用于生物医学等特定领域日益受到关注。然而,现有生物医学LLMs多聚焦于提升单语生物医学问答及对话任务性能。为深入探究LLMs在不同语言环境下对多样化生物医学自然语言处理任务的有效性,我们提出Taiyi——一个面向多样化生物医学任务的双语(英文与中文)微调LLM。本研究首先系统整理了覆盖10余种任务类型的140个现有生物医学文本挖掘数据集。继而提出两阶段监督微调策略以优化模型在多样化任务上的表现。在涵盖命名实体识别、关系抽取、文本分类及问答任务的13个测试集上的实验结果表明,Taiyi相比通用LLMs展现出更优性能。针对额外生物医学自然语言处理任务的案例研究进一步揭示了Taiyi在双语生物医学多任务处理中的巨大潜力。Taiyi的源代码、数据集及模型已在https://github.com/DUTIR-BioNLP/Taiyi-LLM 开源共享。