Recent advances in large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates the performance of LLMs such as the ChatGPT family of models (GPT-3.5s, GPT-4) in biomedical tasks beyond question-answering. Because no patient data can be passed to the OpenAI API public interface, we evaluated model performance with over 10000 samples as proxies for two fundamental tasks in the clinical domain - classification and reasoning. The first task is classifying whether statements of clinical and policy recommendations in scientific literature constitute health advice. The second task is causal relation detection from the biomedical literature. We compared LLMs with simpler models, such as bag-of-words (BoW) with logistic regression, and fine-tuned BioBERT models. Despite the excitement around viral ChatGPT, we found that fine-tuning for two fundamental NLP tasks remained the best strategy. The simple BoW model performed on par with the most complex LLM prompting. Prompt engineering required significant investment.
翻译:近期大型语言模型(LLMs)的进展在生物医学问答领域展现出卓越能力,但在更具体的生物医学应用中的研究尚不充分。本研究探究了ChatGPT系列模型(GPT-3.5s、GPT-4)等LLMs在问答之外的生物医学任务中的表现。由于无法向OpenAI API公共接口传递患者数据,我们使用超过10000个样本作为替代指标,评估了模型在临床领域两项基础任务——分类与推理中的表现。第一项任务是判断科学文献中的临床与政策建议陈述是否构成健康建议。第二项任务是从生物医学文献中检测因果关系。我们将LLMs与基于词袋模型(BoW)加逻辑回归的简单模型、微调后的BioBERT模型进行对比。尽管病毒式传播的ChatGPT引发热潮,但我们的研究发现,针对两项基础NLP任务进行微调仍是最优策略。简单的BoW模型性能与最复杂的LLM提示方法相当,而提示工程需要投入大量资源。