Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D
翻译:近期,三维医学影像的视觉-语言建模进展得益于大规模配对自由文本报告的计算机断层扫描(CT)语料库、更强的架构以及强大的预训练模型。这催生了自动报告生成和文本条件三维图像合成等应用。然而,当前方法在处理高分辨率、长序列体数据时面临挑战:对比预训练常导致视觉编码器与临床语言失配,而逐层切片表征化会模糊精细解剖结构,降低下游任务的诊断性能。我们提出BTB3D(Better Tokens for Better 3D),一种因果卷积编码器-解码器,它统一了二维与三维的训练和推理过程,同时生成紧凑的、频率感知的体素表征。通过三阶段训练课程,模型实现了(i)局部重建,(ii)重叠窗口平铺,以及(iii)长上下文解码器精炼。在此过程中,模型仅从短切片片段学习,却能泛化至超过300层的扫描,且无需额外内存开销。BTB3D在两个关键任务上确立了新的性能标杆:在报告生成任务中,相较于CT2Rep、CT-CHAT和Merlin,其BLEU分数显著提升,临床F1值提高40%;在文本到CT合成任务中,相比GenerateCT和MedSyn,FID降低75%,FVD减半,并能生成解剖结构一致的512*512*241体数据。这些结果证实,精确的三维表征化,而非仅依赖更大的语言模型主干,对于三维医学影像中可扩展的视觉-语言建模至关重要。代码库已开源:https://github.com/ibrahimethemhamamci/BTB3D