XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

翻译：自主医疗与机器人系统日益依赖智能感知与推理能力，以解读视觉数据并支持临床决策。放射学报告生成是此类自动化诊断流程的关键组成部分，然而现有端到端多模态模型常存在视觉定位薄弱的问题，导致解读不可靠且遗漏细微临床发现。本文提出XMedFusion这一模块化AI框架，旨在作为自主医疗系统的智能感知与推理模块。该框架将视觉信息分解为相互协调的功能组件，模拟专家驱动的分析过程，包括：提取图像支撑证据的视觉感知体、构建临床相关发现结构的知识图谱构建体，以及确保报告结构一致的检索引导草稿生成体。合成体通过推理驱动的验证迭代整合视觉与结构化证据，生成可靠且可解释的诊断输出。在公开胸部X光片数据集上的实验评估显示，相较于基线视觉-语言模型，该框架在BLEU-1指标上提升0.0493至0.3359，ROUGE-L指标上提升0.0863至0.2440，METEOR指标上提升0.0829至0.1708，同时在语义评估指标如一致性（2.38至7.80）和准确性（2.34至6.93）方面也有显著改进。结果凸显了结构化多智能体感知与推理在增强智能医学成像系统鲁棒性、透明度和自动化水平方面的有效性，为其融入自主医疗与机器人诊断流程奠定了基础。