To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.
翻译:本研究旨在确定使用离散语义熵来拒绝可能产生幻觉的问题,是否能够提高黑盒视觉语言模型在基于放射学图像的视觉问答任务中的准确性。这项回顾性研究使用两个公开可用的、经去标识化的数据集评估了离散语义熵方法:(i) VQA-Med 2019基准数据集(包含500幅图像及其临床问题和简短文本答案),以及(ii) 一个诊断放射学数据集(206个病例:60个计算机断层扫描、60个磁共振成像、60个X光片、26个血管造影),并配有相应的真实诊断。GPT-4o和GPT-4.1在温度参数设为1.0的条件下,对每个问题各生成15次回答。基线准确性通过低温回答(温度0.1)确定。使用双向蕴含检查对意义等效的回答进行分组,并根据生成的语义簇的相对频率计算离散语义熵。在排除离散语义熵值大于0.6或大于0.3的问题后,重新计算了准确性。p值和95%置信区间通过自助重采样法获得,并使用经过Bonferroni校正的阈值p < .004来确定统计显著性。在总共706个图像-问题对中,GPT-4o的基线准确性为51.7%,GPT-4.1为54.8%。在过滤掉高熵问题(离散语义熵 > 0.3)后,剩余问题上的准确性对于GPT-4o为76.3%(保留问题:334/706),对于GPT-4.1为63.8%(保留问题:499/706)(两者p值均 < .001)。在两个数据集中均观察到了准确性的提升,并且在Bonferroni校正后,这些提升在很大程度上仍具有统计显著性。离散语义熵通过量化语义不一致性,实现了对黑盒视觉语言模型中幻觉的可靠检测。该方法显著提高了诊断答案的准确性,并为临床视觉语言模型应用提供了一种过滤策略。