Large language models (LLMs) have made rapid improvement on medical benchmarks, but their unreliability remains a persistent challenge for safe real-world uses. To design for the use LLMs as a category, rather than for specific models, requires developing an understanding of shared strengths and weaknesses which appear across models. To address this challenge, we benchmark a range of top LLMs and identify consistent patterns across models. We test $16$ well-known LLMs on $874$ newly collected questions from Polish medical licensing exams. For each question, we score each model on the top-1 accuracy and the distribution of probabilities assigned. We then compare these results with factors such as question difficulty for humans, question length, and the scores of the other models. LLM accuracies were positively correlated pairwise ($0.39$ to $0.58$). Model performance was also correlated with human performance ($0.09$ to $0.13$), but negatively correlated to the difference between the question-level accuracy of top-scoring and bottom-scoring humans ($-0.09$ to $-0.14$). The top output probability and question length were positive and negative predictors of accuracy respectively (p$< 0.05$). The top scoring LLM, GPT-4o Turbo, scored $84\%$, with Claude Opus, Gemini 1.5 Pro and Llama 3/3.1 between $74\%$ and $79\%$. We found evidence of similarities between models in which questions they answer correctly, as well as similarities with human test takers. Larger models typically performed better, but differences in training, architecture, and data were also highly impactful. Model accuracy was positively correlated with confidence, but negatively correlated with question length. We find similar results with older models, and argue that these patterns are likely to persist across future models using similar training methods.
翻译:大型语言模型(LLMs)在医学基准测试上取得了快速进步,但其不可靠性仍然是安全实际应用中的一个持续挑战。要将LLMs作为一个类别而非特定模型来设计应用,需要理解不同模型间存在的共性优势与弱点。为应对这一挑战,我们对一系列顶尖LLMs进行了基准测试,并识别出跨模型的一致性模式。我们在874道新收集的波兰医学执照考试题目上测试了16个知名LLMs。针对每道题目,我们根据top-1准确率及模型赋予的概率分布对每个模型进行评分。随后,我们将这些结果与人类答题难度、题目长度及其他模型得分等因素进行比较。LLMs的准确率呈两两正相关(0.39至0.58)。模型表现与人类表现也呈正相关(0.09至0.13),但与高分人类和低分人类在题目级别准确率差异呈负相关(-0.09至-0.14)。最高输出概率和题目长度分别成为准确率的正向与负向预测因子(p<0.05)。得分最高的LLM——GPT-4o Turbo达到84%的准确率,Claude Opus、Gemini 1.5 Pro及Llama 3/3.1的准确率介于74%至79%之间。我们发现证据表明,不同模型在答对题目方面存在相似性,且与人类应试者具有相似模式。更大规模的模型通常表现更优,但训练方式、架构及数据的差异也具有显著影响。模型准确率与置信度呈正相关,但与题目长度呈负相关。我们在早期模型中也观察到类似结果,并认为这些模式很可能在使用相似训练方法的未来模型中持续存在。