Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.
翻译:在放射学中可靠部署视觉语言模型(VLMs)需要超越表层文本相似性的验证指标,以确保临床保真度和人口统计学公平性。本文研究了当前模型评估中的一个关键盲点:解码策略的使用导致高聚合词元重叠分数,却陷入模板坍塌——模型仅生成重复、安全的通用文本而省略临床术语。若不加解决,这一盲点可能导致指标博弈,使在基准测试中表现良好的模型在临床上缺乏信息价值。为此,我们主张采用词汇多样性度量来检验模型生成结果的临床特异性。我们提出了临床关联位移(CAD),这是一个词汇级框架,用于量化生成报告中基于人口统计学的词汇关联偏移。加权关联擦除(WAE)通过聚合这些偏移来测量跨人口统计学群体的临床信号损失。研究表明,确定性解码会产生高水平的语义擦除,而随机采样能生成多样化输出但可能引入新的偏差,这促使我们从根本上重新思考如何定义“最优”报告。