Vision-language pretraining has been shown to produce high-quality visual encoders which transfer efficiently to downstream computer vision tasks. Contrastive learning approaches have increasingly been adopted for medical vision language pretraining (MVLP), yet recent developments in generative AI offer new modeling alternatives. This paper introduces RadTex, a CNN-encoder transformer-decoder architecture optimized for radiology. We explore bidirectional captioning as an alternative MVLP strategy and demonstrate that RadTex's captioning pretraining is competitive with established contrastive methods, achieving a CheXpert macro-AUC of 89.4%. Additionally, RadTex's lightweight text decoder not only generates clinically relevant radiology reports (macro-F1 score of 0.349), but also provides targeted, interactive responses, highlighting the utility of bidirectional captioning in advancing medical image analysis.
翻译:视觉-语言预训练已被证明能够产生高质量的视觉编码器,这些编码器可高效迁移至下游计算机视觉任务。对比学习方法在医学视觉语言预训练(MVLP)中的应用日益广泛,然而生成式人工智能的最新进展提供了新的建模选择。本文提出RadTex——一种专为放射学优化的CNN编码器-Transformer解码器架构。我们探索双向描述生成作为替代的MVLP策略,并证明RadTex的描述预训练与成熟的对比方法具有竞争力,在CheXpert数据集上实现了89.4%的宏观AUC。此外,RadTex的轻量级文本解码器不仅能生成具有临床意义的放射学报告(宏观F1分数达0.349),还能提供有针对性的交互式响应,凸显了双向描述生成在推进医学图像分析中的实用价值。