Recently, Large Language Models (LLM) have demonstrated impressive capability to solve a wide range of tasks. However, despite their success across various tasks, no prior work has investigated their capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of LLMs on benchmark biomedical tasks. For this purpose, we conduct a comprehensive evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets. To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art fine-tuned biomedical models. This suggests that pretraining on large text corpora makes LLMs quite specialized even in the biomedical domain. We also find that not a single LLM can outperform other LLMs in all tasks, with the performance of different LLMs may vary depending on the task. While their performance is still quite poor in comparison to the biomedical models that were fine-tuned on large training sets, our findings demonstrate that LLMs have the potential to be a valuable tool for various biomedical tasks that lack large annotated data.
翻译:近期,大型语言模型(LLM)在解决各类任务中展现出令人瞩目的能力。然而,尽管它们在多种任务上取得了成功,此前尚无研究探讨其在生物医学领域的能力。为此,本文旨在评估LLM在生物医学基准任务上的表现。我们针对4个主流LLM,在涵盖26个数据集的6项不同生物医学任务中进行了全面评估。据我们所知,这是首个对多种LLM在生物医学领域进行广泛评估与比较的工作。有趣的是,我们的评估发现:在训练集规模较小的生物医学数据集中,零样本LLM甚至能超越当前最先进的微调生物医学模型。这表明,在大规模文本语料库上的预训练使LLM即使在生物医学领域也具有相当的专业性。我们还发现,没有任何单一LLM能在所有任务中胜过其他LLM,不同LLM的性能可能因任务而异。尽管与基于大规模训练集微调的生物医学模型相比,LLM的表现仍显不足,但我们的研究结果表明,LLM有潜力成为缺乏大量标注数据的各类生物医学任务中的宝贵工具。