Representation geometry shapes task performance in vision-language modeling for CT enterography

Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

翻译：计算机断层扫描小肠造影（CT enterography）是评估炎症性肠病（IBD）的主要成像模态，但目前尚不清楚何种表征选择最有利于该模态的自动分析。我们首次开展了腹部CT小肠造影视觉-语言迁移学习研究，并发现两个主要结论。首先，切片嵌入的平均池化在分类疾病评估中表现更优（59.2%的三分类准确率），而注意力池化在跨模态检索中表现更好（文本到图像MRR为0.235）。这一模式在测试的所有LoRA配置中均成立，表明两种池化方式强调了所学表征的不同属性。其次，每一切片的组织对比度比更广泛的空间覆盖更重要：多窗口RGB编码（将互补的亨氏单位窗映射至RGB通道）优于所有通过多平面采样增加空间覆盖的策略，且在此设定中加入冠状面与矢状面视图会降低分类性能。在报告生成方面，无检索上下文的微调在严重程度匹配的随机水平上实现了“严重度差距不超过1级”的准确率（70.4%对比71%的随机水平），表明其除类别分布外几乎未学到排序信息。检索增强生成（RAG）在所有配置下均改进了这一结果，得分比随机基线高出7至14个百分点，并将有序MAE从0.98降至0.80至0.89。一个三教师伪标签框架使得所有比较均无需专家标注。综上，这些发现为该尚未充分探索的模态提供了首个基线，并为构建体素级医学成像的视觉-语言系统提供了实用指导。