Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.
翻译:尽管近期取得进展,医疗基础模型在统一视觉理解与生成任务方面仍面临挑战,因为这两类任务本质上存在目标冲突:语义抽象与像素级重建。现有方法通常基于参数共享的自回归架构,往往导致一项或两项任务的性能受损。为此,我们提出UniX——新一代用于胸部X光图像理解与生成的统一医疗基础模型。UniX将两项任务解耦为理解任务的自回归分支与高保真生成任务的扩散分支。关键创新在于引入跨模态自注意力机制,通过理解特征动态引导生成过程。结合严格的数据清洗流程与多阶段训练策略,该架构实现了任务间的协同合作,同时充分发挥扩散模型在生成质量上的优势。在两个代表性基准测试中,UniX仅使用LLM-CXR四分之一参数量的情况下,在理解性能(Micro-F1)上提升46.1%,在生成质量(FD-RadDino)上提升24.2%。通过与任务专用模型相媲美的性能表现,本研究为协同医疗图像理解与生成建立了可扩展的范式。代码与模型发布于https://github.com/ZrH42/UniX。