BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation(LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN). We also examined whether categories(e.g., QA, IE, and generation) of instructions impact model performance. Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA, 5.7% in IE, and 96% in Generation tasks. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between two tasks. The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

翻译：为提升大语言模型（LLMs）在生物医学自然语言处理（BioNLP）中的性能，本研究引入领域特异性指令数据集，并探究其与多任务学习原则结合的效能。我们构建了包含25,005条指令的BioInstruct数据集，用于对LLaMA 1和LLaMA 2的7B及13B版本进行指令微调。该指令通过从80条人工精选指令中随机抽取三个种子样本，驱动GPT-4语言模型生成。采用低秩自适应（LoRA）方法实现参数高效微调。随后在面向问答（QA）、信息抽取（IE）和文本生成（GEN）三大类BioNLP任务上评估指令微调后的LLMs，并考察指令类别对模型性能的影响。相较于未微调模型，指令微调后的LLMs在问答任务中性能提升17.3%，信息抽取任务提升5.7%，生成任务提升96%。基于LLaMA 1微调的70亿参数模型，其性能在生物医学领域达到乃至超越其他同样基于LLaMA 1但使用海量领域数据或多任务训练的LLMs。实验表明，当指令微调聚焦于密切关联任务时，性能增益显著更高。该发现与多任务学习理论相契合，揭示了任务间的协同效应。BioInstruct数据集可作为宝贵资源，其指令微调技术为构建高性能BioNLP应用提供了最优解决方案。