Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

翻译：人工智能系统正以前所未有的速度部署于医学影像领域，但其故障模式尚未被充分理解。当前临床关注度最高的故障是幻觉现象：临床看似合理但事实上错误的输出，包括虚构的解剖结构、遗漏的发现、错误的侧向判断，以及生成报告中编造的测量数据，这些错误会直接导致活检决策、分期和治疗规划层面的严重后果。本结构化综述整合了经同行评议的研究论文、基准数据集及FDA监管指南，覆盖五种成像模态，构建了多模态层面的幻觉分类学、病因学、检测与缓解交叉分析。具体而言，本研究聚焦三个问题：(1)如何统一不同模态下的现有分类体系？(2)医学专用基础模型为何比通用模型产生更少的幻觉？(3)哪些缓解策略有效且符合FDA生命周期监管要求？我们注意到，三种分类框架共同覆盖了成像工作流，而单一框架无法独立实现这一目标。我们还发现，通用基础模型在幻觉专用基准测试中的表现优于医学专用模型，这表明窄域微调可能引入因过拟合导致的虚构输出。与此同时，放射科医生的监督仍然不可或缺：例如，极高比例的AI生成标志在临床使用前需经专家校正。物理约束架构、思维链提示及人机回环防御机制各针对不同故障模式，且组合使用时效果显著。所有研究结果均映射至FDA的“全产品生命周期”与“预定变更控制计划”框架，该框架将幻觉管理视为生命周期义务，而非部署前的检查清单。