Deep learning models in medical image analysis often struggle with generalizability across domains and demographic groups due to data heterogeneity and scarcity. Traditional augmentation improves robustness, but fails under substantial domain shifts. Recent advances in stylistic augmentation enhance domain generalization by varying image styles but fall short in terms of style diversity or by introducing artifacts into the generated images. To address these limitations, we propose Stylizing ViT, a novel Vision Transformer encoder that utilizes weight-shared attention blocks for both self- and cross-attention. This design allows the same attention block to maintain anatomical consistency through self-attention while performing style transfer via cross-attention. We assess the effectiveness of our method for domain generalization by employing it for data augmentation on three distinct image classification tasks in the context of histopathology and dermatology. Results demonstrate an improved robustness (up to +13% accuracy) over the state of the art while generating perceptually convincing images without artifacts. Additionally, we show that Stylizing ViT is effective beyond training, achieving a 17% performance improvement during inference when used for test-time augmentation. The source code is available at https://github.com/sdoerrich97/stylizing-vit .
翻译:医学图像分析中的深度学习模型常因数据异质性和稀缺性,在跨领域和跨人口群体泛化方面面临挑战。传统数据增强方法虽能提升鲁棒性,但在显著领域偏移下仍显不足。近期风格化增强技术通过变换图像风格来增强领域泛化能力,但存在风格多样性不足或在生成图像中引入伪影的问题。为突破这些局限,我们提出风格化ViT——一种新颖的视觉Transformer编码器,其采用权重共享注意力模块同时处理自注意力与交叉注意力。该设计使得同一注意力模块既能通过自注意力保持解剖结构一致性,又能借助交叉注意力执行风格迁移。我们在组织病理学与皮肤病学领域的三个独立图像分类任务中,通过数据增强验证了该方法在领域泛化方面的有效性。实验结果表明,相较于现有最优方法,本方法在生成感知可信且无伪影图像的同时,显著提升了模型鲁棒性(最高达+13%准确率)。此外,我们证明风格化ViT在训练阶段之外同样有效:当用于测试时数据增强时,推理性能可提升17%。源代码已发布于 https://github.com/sdoerrich97/stylizing-vit 。