Self-supervised Vision Transformer are Scalable Generative Models for Domain Generalization

from arxiv, Accepted at MICCAI 2024. This is the submitted manuscript with added link to github repo and funding acknowledgements. No further post submission improvements or corrections were integrated. Final version not published yet

Despite notable advancements, the integration of deep learning (DL) techniques into impactful clinical applications, particularly in the realm of digital histopathology, has been hindered by challenges associated with achieving robust generalization across diverse imaging domains and characteristics. Traditional mitigation strategies in this field such as data augmentation and stain color normalization have proven insufficient in addressing this limitation, necessitating the exploration of alternative methodologies. To this end, we propose a novel generative method for domain generalization in histopathology images. Our method employs a generative, self-supervised Vision Transformer to dynamically extract characteristics of image patches and seamlessly infuse them into the original images, thereby creating novel, synthetic images with diverse attributes. By enriching the dataset with such synthesized images, we aim to enhance its holistic nature, facilitating improved generalization of DL models to unseen domains. Extensive experiments conducted on two distinct histopathology datasets demonstrate the effectiveness of our proposed approach, outperforming the state of the art substantially, on the Camelyon17-wilds challenge dataset (+2%) and on a second epithelium-stroma dataset (+26%). Furthermore, we emphasize our method's ability to readily scale with increasingly available unlabeled data samples and more complex, higher parametric architectures. Source code is available at https://github.com/sdoerrich97/vits-are-generative-models .

翻译：尽管深度学习技术取得了显著进展，但其在具有影响力的临床应用（特别是在数字组织病理学领域）中的整合，一直受到跨多样成像域和特征实现鲁棒泛化能力相关挑战的阻碍。该领域中传统的缓解策略（如数据增强和染色颜色归一化）已被证明不足以解决这一局限性，因此有必要探索替代方法。为此，我们提出了一种用于组织病理学图像领域泛化的新型生成方法。我们的方法采用生成式自监督视觉Transformer来动态提取图像块的特征，并将其无缝注入原始图像，从而创建具有多样化属性的新型合成图像。通过用此类合成图像丰富数据集，我们的目标是增强其整体性，促进深度学习模型对未见域实现更好的泛化。在两个不同的组织病理学数据集上进行的大量实验证明了我们提出的方法的有效性，在Camelyon17-wilds挑战数据集（+2%）和第二个上皮-间质数据集（+26%）上均显著优于现有技术水平。此外，我们强调我们的方法能够随着日益可用的未标记数据样本以及更复杂、更高参数的架构而轻松扩展。源代码可在 https://github.com/sdoerrich97/vits-are-generative-models 获取。