Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging

Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D

翻译：近期，三维医学影像的视觉-语言建模进展得益于大规模配对自由文本报告的计算机断层扫描（CT）语料库、更强的架构以及强大的预训练模型。这催生了自动报告生成和文本条件三维图像合成等应用。然而，当前方法在处理高分辨率、长序列体数据时面临挑战：对比预训练常导致视觉编码器与临床语言失配，而逐层切片表征化会模糊精细解剖结构，降低下游任务的诊断性能。我们提出BTB3D（Better Tokens for Better 3D），一种因果卷积编码器-解码器，它统一了二维与三维的训练和推理过程，同时生成紧凑的、频率感知的体素表征。通过三阶段训练课程，模型实现了（i）局部重建，（ii）重叠窗口平铺，以及（iii）长上下文解码器精炼。在此过程中，模型仅从短切片片段学习，却能泛化至超过300层的扫描，且无需额外内存开销。BTB3D在两个关键任务上确立了新的性能标杆：在报告生成任务中，相较于CT2Rep、CT-CHAT和Merlin，其BLEU分数显著提升，临床F1值提高40%；在文本到CT合成任务中，相比GenerateCT和MedSyn，FID降低75%，FVD减半，并能生成解剖结构一致的512*512*241体数据。这些结果证实，精确的三维表征化，而非仅依赖更大的语言模型主干，对于三维医学影像中可扩展的视觉-语言建模至关重要。代码库已开源：https://github.com/ibrahimethemhamamci/BTB3D