Safe deployment of clinical vision-language models (VLMs) requires reliable uncertainty estimation (UE): a signal indicating when predictions should be trusted or escalated to a clinician. We test whether current UE methods actually deliver this signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering (VQA), we find that UE quality is not an intrinsic property of the UE method: it tracks model accuracy, degrading precisely where the model performance is weakest, and therefore where reliability is most needed. When we stress-test models by hiding the correct option among the multiple-choice answers (NOTA perturbations), accuracy collapses while uncertainty barely changes, leaving models systematically miscalibrated. Yet, we find that uncertainty on the unperturbed input reliably anticipates which predictions will collapse under NOTA, indicating that UE in current VLMs carries diagnostic information about model fragility. Our results position UE as a diagnostic tool for identifying fragile predictions and motivate perturbation-based evaluation as a path toward safe clinical deployment.
翻译:临床视觉语言模型(VLM)的安全部署需要可靠的置信度评估(UE):这是一种信号,用于指示何时应信任预测结果或将其上报给临床医生。我们测试了当前UE方法是否真正提供了这一信号。通过在12个VLM上对8种方法进行临床视觉问答(VQA)基准测试,我们发现UE质量并非UE方法的内在属性:它随模型准确率变化,恰好在模型性能最弱(即最需要可靠性)的地方恶化。当我们通过隐藏多项选择答案中的正确选项(NOTA扰动)对模型进行压力测试时,准确率骤降而置信度几乎不变,导致模型系统性校准偏差。然而,我们发现未扰动输入上的置信度能可靠地预测哪些预测会在NOTA下崩溃,表明当前VLM中的UE携带关于模型脆弱性的诊断信息。我们的结果将UE定位为识别脆弱预测的诊断工具,并推动基于扰动的评估作为走向安全临床部署的途径。