OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

Tianwei Lin,Zhongwei Qiu,Wenqiao Zhang,Jiang Liu,Yihan Xie,Mingjian Gao,Zhenxuan Fan,Zhaocheng Li,Sijing Li,Zhongle Xie,Peng LU,Yueting Zhuang,Yingda Xia,Ling Zhang,Beng Chin Ooi

Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.

翻译：计算机断层扫描（CT）是最广泛应用且诊断信息最密集的成像模态之一，涵盖心脏、肺、肝脏和结肠等关键器官。临床解读既依赖于切片驱动的局部特征（如亚厘米结节、病灶边界），也依赖于体积驱动的空间表征（如肿瘤浸润、器官间解剖关系）。然而，现有极大视觉语言模型（LVLM）在CT切片与体积理解方面仍处于割裂状态：切片驱动的LVLM展现出强泛化能力但缺乏跨切片空间一致性，而体积驱动的LVLM虽能显式捕捉体积语义，却存在粒度粗糙、与切片输入兼容性差的问题。统一建模范式的缺失构成了医学LVLM临床转化的主要瓶颈。本文提出OmniCT，一种面向CT场景的强大统一切片-体积LVLM，其贡献包括：（i）空间一致性增强（SCE）：通过体积切片组合与三轴位置编码引入体积一致性，并采用混合专家（MoE）混合投影实现高效的切片-体积适配；（ii）器官级语义增强（OSE）：通过分割与感兴趣区域定位显式对齐解剖区域，强化病灶级与器官级语义；（iii）MedEval-CT：最大规模的切片-体积CT数据集与混合基准，整合了统一评估的综合指标。OmniCT在多样化临床任务中均以显著优势超越现有方法，同时满足微观细节敏感性与宏观空间推理需求。更重要的是，它为跨模态医学影像理解建立了新范式。