Large Language Models (LLMs), particularly those similar to ChatGPT, have significantly influenced the field of Natural Language Processing (NLP). While these models excel in general language tasks, their performance in domain-specific downstream tasks such as biomedical and clinical Named Entity Recognition (NER), Relation Extraction (RE), and Medical Natural Language Inference (NLI) is still evolving. In this context, our study investigates the potential of instruction tuning for biomedical language processing, applying this technique to two general LLMs of substantial scale. We present a comprehensive, instruction-based model trained on a dataset that consists of approximately $200,000$ instruction-focused samples. This dataset represents a carefully curated compilation of existing data, meticulously adapted and reformatted to align with the specific requirements of our instruction-based tasks. This initiative represents an important step in utilising such models to achieve results on par with specialised encoder-only models like BioBERT and BioClinicalBERT for various classical biomedical NLP tasks. Our work includes an analysis of the dataset's composition and its impact on model performance, providing insights into the intricacies of instruction tuning. By sharing our codes, models, and the distinctively assembled instruction-based dataset, we seek to encourage ongoing research and development in this area.
翻译:大型语言模型(LLMs),尤其是类似ChatGPT的模型,已显著影响自然语言处理(NLP)领域。尽管这些模型在通用语言任务中表现出色,但在生物医学和临床命名实体识别(NER)、关系抽取(RE)以及医学自然语言推理(NLI)等特定领域下游任务中的性能仍在不断演变。在此背景下,我们的研究探讨了指令微调在生物医学语言处理中的潜力,并将该技术应用于两个规模可观的大型语言模型。我们提出了一个全面的、基于指令的模型,该模型在包含约20万条指令聚焦样本的数据集上进行了训练。该数据集是对现有数据进行精心策划的汇编,经过细致调整和重新格式化,以满足我们基于指令的任务的特定要求。这项工作是利用此类模型在各类经典生物医学NLP任务中取得与专用编码器模型(如BioBERT和BioClinicalBERT)相媲美结果的重要一步。我们的工作包括分析数据集的构成及其对模型性能的影响,从而揭示指令微调的复杂性。通过分享我们的代码、模型以及独特构建的基于指令的数据集,我们旨在鼓励该领域的持续研究和发展。