We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.
翻译:我们评估了四种最先进的指令微调大语言模型(LLMs)——ChatGPT、Flan-T5 UL2、Tk-Instruct和Alpaca——在13项真实世界临床与生物医学自然语言处理(NLP)任务中的表现,包括命名实体识别(NER)、问答(QA)、关系抽取(RE)等英文任务。总体结果表明,尽管这些LLMs从未接触过相关任务的示例,但它们在零样本和少样本场景下开始接近最先进模型的性能,尤其是在问答任务中表现优异。然而,我们观察到分类与关系抽取任务的表现低于专门为医学领域训练的模型(如PubMedBERT)。最后,我们发现没有任何一个LLM在所有研究任务中全面优于其他模型,某些模型在特定任务上具有更优的适用性。