Calibrated Triage, Not Autonomy: Confidence Estimation for Medical Vision-Language Models

A vision-language model can answer a question about a medical image fluently and confidently while barely using the image, leaning instead on language priors. In medicine this is the failure that matters most, because the answer looks trustworthy and is not, and the only protection is a confidence score reliable enough to tell the system when to abstain. We ask a deployment question rather than an accuracy one: how much imaging work a model can safely handle alone, and which confidence signal makes that possible. We evaluate seven confidence estimators across five open-weight LVLMs and three medical visual-question-answering datasets spanning broad clinical imaging, radiology, and pathology, with every probe trained only on natural images and applied without adaptation. Recast as bounded selective prediction (automate a case only when confidence clears a threshold, defer the rest), the comparison is cautionary. The standard metrics are poor guides: discrimination barely separates the methods, and the weak calibration of a cheap self-report is cheaply removed by off-domain temperature scaling without changing deployable yield. What distinguishes a usable estimator is the high-confidence region a clinician acts on: the weakest baselines are confidently wrong on 41 to 45 percent of their errors against 1 to 4 percent for the best probe, and no estimator is reliably best across domains or models. Safe handoff is governed at two levels: base-model competence sets a ceiling, so a well-calibrated score recovers roughly a third of radiology cases at a 20 percent error tolerance but almost none of pathology; the confidence layer then decides how much of that ceiling is reachable. The usable role today is calibrated triage, not autonomy: automate the cases a calibrated score marks safe, route the rest to a clinician. We release all outputs, correctness judgments, and confidence scores, with code.

翻译：视觉语言模型能够流畅且自信地回答关于医学图像的问题，却几乎不依赖图像本身，而更多借助语言先验。在医学领域，这是最关键的失效模式——因为看似可信的答案实则不然，唯一保障是足够可靠的置信度分数，以便系统在必要时选择弃权。我们提出的是部署问题而非准确率问题：模型能安全独立处理多少影像工作？何种置信度信号能实现这一目标？我们评估了五种开源大型视觉语言模型（LVLMs）上的七种置信度估计算法，以及横跨临床影像、放射学与病理学的三个医学视觉问答数据集。所有探测方法均仅使用自然图像训练且未经调适直接应用。将问题重构为有界选择性预测（仅当置信度超过阈值时自动化处理病例，其余转交人类），比较结果具有警示意义。标准评价指标存在缺陷：区分度几乎无法区分不同方法，且廉价自我报告的弱校准性可通过域外温度缩放轻易修正，而不改变可部署产出。真正区分可用估计算法的关键，在于临床医生信任的高置信区域：最弱基线方法在其误差中41%至45%属于自信错误（即高置信但错误），而最佳探测方法仅占1%至4%；且没有任何估计算法能在所有领域或模型上保持可靠最优。安全交接受两个层次支配：基础模型能力设定上限——良好校准的分数在20%误差容忍度下可恢复约三分之一的放射病例，但对病理病例几乎无能为力；置信层则决定该上限可达程度。当前可行角色是校准式分诊而非完全自主：将校准分数标记为安全的病例自动化处理，其余分流至临床医生。我们公开所有输出结果、正确性判断及置信度分数，并附代码。