Can large language models reason about medical questions?

from arxiv, 37 pages, 23 figures. v1: results using InstructGPT, v2.0: added the Codex experiments, v2.1: added the missing test MedMCQA results for Codex 5-shot CoT and using k=100 samples, v3.0: added results for open source models -- ready for publication (final version)

Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether close- and open-source models (GPT-3.5, LLama-2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), few-shot and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions, but also reaches the passing score on three datasets: MedQA-USMLE 60.2%, MedMCQA 62.7% and PubMedQA 78.2%. Open-source models are closing the gap: Llama-2 70B also passed the MedQA-USMLE with 62.5% accuracy.

翻译：尽管大型语言模型（LLMs）常能生成令人印象深刻的结果，但它们在需要强推理能力和专业领域知识的真实场景中表现如何仍不明确。本研究旨在探究闭源与开源模型（如GPT-3.5、Llama-2等）能否回答并推理基于真实世界的复杂问题。我们聚焦三个权威医学基准测试集（MedQA-USMLE、MedMCQA和PubMedQA），并采用多种提示策略：思维链（CoT，逐步思考）、少样本学习和检索增强生成。基于专家对生成思维链的标注，我们发现InstructGPT通常能够有效阅读、推理并调取专家知识。最后，通过利用提示工程（少样本学习和集成方法）的最新进展，我们证明GPT-3.5不仅产生校准后的预测分布，还在三个数据集上达到及格线：MedQA-USMLE 60.2%、MedMCQA 62.7%和PubMedQA 78.2%。开源模型正缩小差距：Llama-2 70B同样以62.5%的准确率通过了MedQA-USMLE测试。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

斯坦福李飞飞高徒Johnson博士论文: 组成式计算机视觉智能,195页PDF

专知会员服务

71+阅读 · 2019年10月27日