As artificial intelligence (AI) becomes increasingly central to healthcare, the demand for explainable and trustworthy models is paramount. Current report generation systems for chest X-rays (CXR) often lack mechanisms for validating outputs without expert oversight, raising concerns about reliability and interpretability. To address these challenges, we propose a novel multimodal framework designed to enhance the semantic alignment and localization accuracy of AI-generated medical reports. Our framework integrates two key modules: a Phrase Grounding Model, which identifies and localizes pathologies in CXR images based on textual prompts, and a Text-to-Image Diffusion Module, which generates synthetic CXR images from prompts while preserving anatomical fidelity. By comparing features between the original and generated images, we introduce a dual-scoring system: one score quantifies localization accuracy, while the other evaluates semantic consistency. This approach significantly outperforms existing methods, achieving state-of-the-art results in pathology localization and text-to-image alignment. The integration of phrase grounding with diffusion models, coupled with the dual-scoring evaluation system, provides a robust mechanism for validating report quality, paving the way for more trustworthy and transparent AI in medical imaging.
翻译:随着人工智能在医疗领域日益占据核心地位,对可解释且可信赖模型的需求变得至关重要。当前针对胸部X光片的报告生成系统通常缺乏无需专家监督的输出验证机制,这引发了对其可靠性与可解释性的担忧。为应对这些挑战,我们提出一种新颖的多模态框架,旨在增强AI生成医学报告的语义对齐与定位准确性。该框架整合了两个关键模块:基于文本提示在CXR图像中识别并定位病理的短语定位模型,以及从提示生成合成CXR图像同时保持解剖保真度的文生图扩散模块。通过比较原始图像与生成图像的特征,我们引入了双评分系统:一个分数量化定位准确性,另一个评估语义一致性。该方法显著优于现有技术,在病理定位和图文对齐方面取得了最先进的结果。短语定位与扩散模型的结合,辅以双评分评估系统,为验证报告质量提供了稳健机制,为医学影像领域实现更可信、更透明的人工智能铺平了道路。