Recently, Large Language Models (LLM) have demonstrated impressive capability to solve a wide range of tasks. However, despite their success across various tasks, no prior work has investigated their capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of LLMs on benchmark biomedical tasks. For this purpose, we conduct a comprehensive evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets. To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art fine-tuned biomedical models. This suggests that pretraining on large text corpora makes LLMs quite specialized even in the biomedical domain. We also find that not a single LLM can outperform other LLMs in all tasks, with the performance of different LLMs may vary depending on the task. While their performance is still quite poor in comparison to the biomedical models that were fine-tuned on large training sets, our findings demonstrate that LLMs have the potential to be a valuable tool for various biomedical tasks that lack large annotated data.
翻译:近期,大型语言模型(LLM)在解决各类任务中展现出令人瞩目的能力。然而,尽管其在多种任务上取得成功,目前尚无研究探讨其在生物医学领域的表现。为此,本文旨在评估LLM在基准生物医学任务中的性能。我们针对26个数据集上的6项多样化生物医学任务,对4种主流LLM进行了全面评估。据我们所知,这是首个对多种LLM在生物医学领域进行广泛评估与比较的工作。有趣的是,我们发现:在训练集规模较小的生物医学数据集中,零样本LLM甚至能超越当前最先进的微调生物医学模型。这表明,在大规模文本语料上的预训练使LLM即使在生物医学领域也具备高度专门化能力。同时,我们发现没有任何单一LLM能在所有任务中优于其他模型,不同LLM的性能会随任务类型而变化。尽管与基于大规模训练集微调的生物医学模型相比,LLM的表现仍显不足,但我们的研究结果表明,LLM有潜力成为缺乏大规模标注数据的各类生物医学任务的有价值工具。