Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.
翻译:临床大语言模型通常通过增加模型规模、上下文长度、检索复杂度或推理时计算量进行扩展,并隐含期待更高准确性意味着更安全的行为。这一假设在医学领域并不成立——医学中少数几个自信的、高风险的或与证据矛盾的错误,比平均基准性能更为重要。我们提出SaFE-Scale框架,用于测量临床大语言模型安全性如何随模型规模、证据质量、检索策略、上下文暴露程度和推理时计算量变化。为实例化该框架,我们引入RadSaFE-200——一个包含200道选择题的放射学安全导向评估基准,配有临床专家定义的清晰证据、矛盾证据以及高风险错误、不安全答案和证据矛盾的选项级标签。我们在六种部署条件下评估了34个本地部署的大语言模型:封闭式提示(零样本)、清晰证据、矛盾证据、标准RAG、代理RAG和最大上下文提示。清晰证据带来了最显著的改善,平均准确性从73.5%提升至94.1%,同时高风险错误从12.0%降至2.6%,矛盾率从12.7%降至2.3%,危险过度自信从8.0%降至1.6%。标准RAG和代理RAG未能复现这一安全状况:代理RAG相比标准RAG提高了准确性并降低了矛盾率,但高风险错误和危险过度自信仍居高不下。最大上下文提示增加了延迟但未弥合安全差距,额外的推理时计算仅带来有限改进。最差情况分析显示,临床严重后果错误集中于少数问题。因此,临床大语言模型的安全性并非缩放过程的被动产物,而是受证据质量、检索设计、上下文构建和集体故障行为影响的部署属性。