Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind,Tri-Thien Nguyen,Jeta Sopa,Mahshad Lotfinia,Sebastian Bickelhaup,Michael Uder,Harald Köstler,Gerhard Wellein,Sven Nebelung,Daniel Truhn,Andreas Maier,Soroosh Tayebi Arasteh

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

翻译：临床大语言模型通常通过增加模型规模、上下文长度、检索复杂度或推理时计算量进行扩展，并隐含期待更高准确性意味着更安全的行为。这一假设在医学领域并不成立——医学中少数几个自信的、高风险的或与证据矛盾的错误，比平均基准性能更为重要。我们提出SaFE-Scale框架，用于测量临床大语言模型安全性如何随模型规模、证据质量、检索策略、上下文暴露程度和推理时计算量变化。为实例化该框架，我们引入RadSaFE-200——一个包含200道选择题的放射学安全导向评估基准，配有临床专家定义的清晰证据、矛盾证据以及高风险错误、不安全答案和证据矛盾的选项级标签。我们在六种部署条件下评估了34个本地部署的大语言模型：封闭式提示（零样本）、清晰证据、矛盾证据、标准RAG、代理RAG和最大上下文提示。清晰证据带来了最显著的改善，平均准确性从73.5%提升至94.1%，同时高风险错误从12.0%降至2.6%，矛盾率从12.7%降至2.3%，危险过度自信从8.0%降至1.6%。标准RAG和代理RAG未能复现这一安全状况：代理RAG相比标准RAG提高了准确性并降低了矛盾率，但高风险错误和危险过度自信仍居高不下。最大上下文提示增加了延迟但未弥合安全差距，额外的推理时计算仅带来有限改进。最差情况分析显示，临床严重后果错误集中于少数问题。因此，临床大语言模型的安全性并非缩放过程的被动产物，而是受证据质量、检索设计、上下文构建和集体故障行为影响的部署属性。