Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing

Large Language Models (LLMs), particularly those similar to ChatGPT, have significantly influenced the field of Natural Language Processing (NLP). While these models excel in general language tasks, their performance in domain-specific downstream tasks such as biomedical and clinical Named Entity Recognition (NER), Relation Extraction (RE), and Medical Natural Language Inference (NLI) is still evolving. In this context, our study investigates the potential of instruction tuning for biomedical language processing, applying this technique to two general LLMs of substantial scale. We present a comprehensive, instruction-based model trained on a dataset that consists of approximately $200,000$ instruction-focused samples. This dataset represents a carefully curated compilation of existing data, meticulously adapted and reformatted to align with the specific requirements of our instruction-based tasks. This initiative represents an important step in utilising such models to achieve results on par with specialised encoder-only models like BioBERT and BioClinicalBERT for various classical biomedical NLP tasks. Our work includes an analysis of the dataset's composition and its impact on model performance, providing insights into the intricacies of instruction tuning. By sharing our codes, models, and the distinctively assembled instruction-based dataset, we seek to encourage ongoing research and development in this area.

翻译：大型语言模型（LLMs），尤其是类似ChatGPT的模型，已显著影响自然语言处理（NLP）领域。尽管这些模型在通用语言任务中表现出色，但在生物医学和临床命名实体识别（NER）、关系抽取（RE）以及医学自然语言推理（NLI）等特定领域下游任务中的性能仍在不断演变。在此背景下，我们的研究探讨了指令微调在生物医学语言处理中的潜力，并将该技术应用于两个规模可观的大型语言模型。我们提出了一个全面的、基于指令的模型，该模型在包含约20万条指令聚焦样本的数据集上进行了训练。该数据集是对现有数据进行精心策划的汇编，经过细致调整和重新格式化，以满足我们基于指令的任务的特定要求。这项工作是利用此类模型在各类经典生物医学NLP任务中取得与专用编码器模型（如BioBERT和BioClinicalBERT）相媲美结果的重要一步。我们的工作包括分析数据集的构成及其对模型性能的影响，从而揭示指令微调的复杂性。通过分享我们的代码、模型以及独特构建的基于指令的数据集，我们旨在鼓励该领域的持续研究和发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日