大型语言模型在医学问答中是否存在共性弱点？ (Do Large Language Models have Shared Weaknesses in Medical Question Answering?)

Large language models (LLMs) have made rapid improvement on medical benchmarks, but their unreliability remains a persistent challenge for safe real-world uses. To design for the use LLMs as a category, rather than for specific models, requires developing an understanding of shared strengths and weaknesses which appear across models. To address this challenge, we benchmark a range of top LLMs and identify consistent patterns across models. We test $16$ well-known LLMs on $874$ newly collected questions from Polish medical licensing exams. For each question, we score each model on the top-1 accuracy and the distribution of probabilities assigned. We then compare these results with factors such as question difficulty for humans, question length, and the scores of the other models. LLM accuracies were positively correlated pairwise ($0.39$ to $0.58$). Model performance was also correlated with human performance ($0.09$ to $0.13$), but negatively correlated to the difference between the question-level accuracy of top-scoring and bottom-scoring humans ($-0.09$ to $-0.14$). The top output probability and question length were positive and negative predictors of accuracy respectively (p$< 0.05$). The top scoring LLM, GPT-4o Turbo, scored $84\%$, with Claude Opus, Gemini 1.5 Pro and Llama 3/3.1 between $74\%$ and $79\%$. We found evidence of similarities between models in which questions they answer correctly, as well as similarities with human test takers. Larger models typically performed better, but differences in training, architecture, and data were also highly impactful. Model accuracy was positively correlated with confidence, but negatively correlated with question length. We find similar results with older models, and argue that these patterns are likely to persist across future models using similar training methods.

翻译：大型语言模型（LLMs）在医学基准测试上取得了快速进步，但其不可靠性仍然是安全实际应用中的一个持续挑战。要将LLMs作为一个类别而非特定模型来设计应用，需要理解不同模型间存在的共性优势与弱点。为应对这一挑战，我们对一系列顶尖LLMs进行了基准测试，并识别出跨模型的一致性模式。我们在874道新收集的波兰医学执照考试题目上测试了16个知名LLMs。针对每道题目，我们根据top-1准确率及模型赋予的概率分布对每个模型进行评分。随后，我们将这些结果与人类答题难度、题目长度及其他模型得分等因素进行比较。LLMs的准确率呈两两正相关（0.39至0.58）。模型表现与人类表现也呈正相关（0.09至0.13），但与高分人类和低分人类在题目级别准确率差异呈负相关（-0.09至-0.14）。最高输出概率和题目长度分别成为准确率的正向与负向预测因子（p<0.05）。得分最高的LLM——GPT-4o Turbo达到84%的准确率，Claude Opus、Gemini 1.5 Pro及Llama 3/3.1的准确率介于74%至79%之间。我们发现证据表明，不同模型在答对题目方面存在相似性，且与人类应试者具有相似模式。更大规模的模型通常表现更优，但训练方式、架构及数据的差异也具有显著影响。模型准确率与置信度呈正相关，但与题目长度呈负相关。我们在早期模型中也观察到类似结果，并认为这些模式很可能在使用相似训练方法的未来模型中持续存在。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Nat. Biotechnol. | 机器学习为生物库驱动的药物发现提供动力

专知会员服务

11+阅读 · 2022年9月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日