Masked image modelling (MIM) is a powerful self-supervised representation learning paradigm, whose potential has not been widely demonstrated in medical image analysis. In this work, we show the capacity of MIM to capture rich semantic representations of Haemotoxylin & Eosin (H&E)-stained images at the nuclear level. Inspired by Bidirectional Encoder representation from Image Transformers (BEiT), we split the images into smaller patches and generate corresponding discrete visual tokens. In addition to the regular grid-based patches, typically used in visual Transformers, we introduce patches of individual cell nuclei. We propose positional encoding of the irregular distribution of these structures within an image. We pre-train the model in a self-supervised manner on H&E-stained whole-slide images of diffuse large B-cell lymphoma, where cell nuclei have been segmented. The pre-training objective is to recover the original discrete visual tokens of the masked image on the one hand, and to reconstruct the visual tokens of the masked object instances on the other. Coupling these two pre-training tasks allows us to build powerful, context-aware representations of nuclei. Our model generalizes well and can be fine-tuned on downstream classification tasks, achieving improved cell classification accuracy on PanNuke dataset by more than 5% compared to current instance segmentation methods.
翻译:掩码图像建模(MIM)是一种强大的自监督表征学习范式,但其在医学图像分析中的潜力尚未得到广泛验证。本研究展示了MIM在细胞核层面捕获苏木精-伊红(H&E)染色图像丰富语义表征的能力。受双向图像变换器编码器(BEiT)启发,我们将图像分割为更小的图像块并生成对应的离散视觉标记。在视觉变换器通常采用的规则网格图像块基础上,我们引入了单个细胞核的图像块,并提出了针对这些结构在图像中不规则分布的位置编码方法。我们以自监督方式在弥漫性大B细胞淋巴瘤的H&E染色全切片图像上预训练模型(其中细胞核已分割)。预训练目标一方面恢复掩码图像的原始离散视觉标记,另一方面重建掩码对象实例的视觉标记。结合这两个预训练任务使我们能够构建强大且具有上下文感知能力的细胞核表征。我们的模型具有良好的泛化能力,可通过下游分类任务进行微调,在PanNuke数据集上相较现有实例分割方法实现了超过5%的细胞分类准确率提升。