3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

3D CT analysis spans a continuum from low-level perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical assistants.Code and data are available at \href{https://github.com/jinlab-imvr/3DMedAgent}{https://github.com/jinlab-imvr/3DMedAgent}.

翻译：三维CT分析涵盖从低层感知到高层临床理解的连续过程。现有的三维导向分析方法要么采用孤立的特定任务建模，要么采用任务无关的端到端范式来生成单步输出，这阻碍了感知证据在下游推理中的系统性积累。与此同时，最近的多模态大语言模型（MLLMs）展现出改进的视觉感知能力，并能有效整合视觉与文本信息，然而其以二维为导向的设计从根本上限制了感知和分析三维医学数据的能力。为弥合这一差距，我们提出了3DMedAgent，一个统一的智能体，使二维MLLMs能够在无需三维特定微调的情况下执行通用的三维CT分析。3DMedAgent通过一个灵活的MLLM智能体协调异构的视觉与文本工具，逐步将复杂的三维分析分解为可处理的子任务，这些子任务实现了从全局视图到区域视图、从三维体数据到信息丰富的二维切片、以及从视觉证据到结构化文本表示的过渡。该设计的核心在于，3DMedAgent维护一个长期的结构化记忆，该记忆聚合了中间工具的输出，并支持查询自适应、证据驱动的多步推理。我们进一步引入了DeepChestVQA基准，用于评估三维胸部影像中统一的感知到理解能力。在超过40项任务上的实验表明，3DMedAgent在通用、医学专用及三维专用MLLMs中均表现出持续优越的性能，这为通向通用三维临床助手指明了一条可扩展的路径。代码与数据可在 \href{https://github.com/jinlab-imvr/3DMedAgent}{https://github.com/jinlab-imvr/3DMedAgent} 获取。