Pathology image analysis crucially relies on the availability and quality of annotated pathological samples, which are very difficult to collect and need lots of human effort. To address this issue, beyond traditional preprocess data augmentation methods, mixing-based approaches are effective and practical. However, previous mixing-based data augmentation methods do not thoroughly explore the essential characteristics of pathology images, including the local specificity, global distribution, and inner/outer-sample instance relationship. To further understand the pathology characteristics and make up effective pseudo samples, we propose the CellMix framework with a novel distribution-based in-place shuffle strategy. We split the images into patches with respect to the granularity of pathology instances and do the shuffle process across the same batch. In this way, we generate new samples while keeping the absolute relationship of pathology instances intact. Furthermore, to deal with the perturbations and distribution-based noise, we devise a loss-drive strategy inspired by curriculum learning during the training process, making the model fit the augmented data adaptively. It is worth mentioning that we are the first to explore data augmentation techniques in the pathology image field. Experiments show SOTA results on 7 different datasets. We conclude that this novel instance relationship-based strategy can shed light on general data augmentation for pathology image analysis. The code is available at https://github.com/sagizty/CellMix.
翻译:病理图像分析高度依赖于标注病理样本的可用性和质量,而这类样本的收集极为困难且需耗费大量人力。为解决此问题,除传统预处理数据增强方法外,基于混合的方法既有效又实用。但现有基于混合的数据增强方法未能充分探索病理图像的本质特征,包括局部特异性、全局分布以及样本内/样本间实例关系。为深入理解病理特征并生成有效的伪样本,我们提出了CellMix框架,该框架采用一种新颖的基于分布的原位洗牌策略。我们根据病理实例的粒度对图像进行分块,并在同一批次内执行洗牌操作。通过这种方式,我们在保持病理实例绝对关系不变的前提下生成新样本。此外,为应对扰动和基于分布的噪声,我们借鉴课程学习思想设计了损失驱动策略,使模型自适应地适应增强数据。值得一提的是,我们是首个探索病理图像领域数据增强技术的研究。实验在7个不同数据集上取得了SOTA结果。我们得出结论:这种新颖的基于实例关系的方法可为病理图像分析的通用数据增强提供启示。代码开源地址:https://github.com/sagizty/CellMix。