Automatic methods for early detection of breast cancer on mammography can significantly decrease mortality. Broad uptake of those methods in hospitals is currently hindered because the methods have too many constraints. They assume annotations available for single images or even regions-of-interest (ROIs), and a fixed number of images per patient. Both assumptions do not hold in a general hospital setting. Relaxing those assumptions results in a weakly supervised learning setting, where labels are available per case, but not for individual images or ROIs. Not all images taken for a patient contain malignant regions and the malignant ROIs cover only a tiny part of an image, whereas most image regions represent benign tissue. In this work, we investigate a two-level multi-instance learning (MIL) approach for case-level breast cancer prediction on two public datasets (1.6k and 5k cases) and an in-house dataset of 21k cases. Observing that breast cancer is usually only present in one side, while images of both breasts are taken as a precaution, we propose a domain-specific MIL pooling variant. We show that two-level MIL can be applied in realistic clinical settings where only case labels, and a variable number of images per patient are available. Data in realistic settings scales with continuous patient intake, while manual annotation efforts do not. Hence, research should focus in particular on unsupervised ROI extraction, in order to improve breast cancer prediction for all patients.
翻译:乳腺钼靶摄影的早期自动检测方法可显著降低乳腺癌死亡率。然而,这些方法在医院的广泛应用目前受到诸多限制:它们要求提供单张图像甚至感兴趣区域(ROI)的标注信息,且每位患者需包含固定数量的图像。在普通医院环境中,这两种假设均不成立。放宽这些假设将形成弱监督学习场景——仅能获取病例级标签,而非单张图像或ROI级标注。患者所拍摄的乳腺图像中,并非所有图像都包含恶性区域,恶性ROI仅覆盖图像极小部分,而大部分区域为良性组织。本研究提出一种两级多实例学习(MIL)方法,用于在三个数据集(两个公开数据集分别包含1.6k和5k例病例,以及内部21k例数据集)上进行病例级乳腺癌预测。基于乳腺癌通常仅单侧发病而临床预防性对双侧乳腺进行拍片的观察,我们提出一种领域特定MIL池化变体。研究表明,两级MIL可应用于真实临床场景——仅需病例标签和每位患者可变数量的图像。在真实场景中,数据量随患者连续接诊而动态增长,人工标注工作量却无法同步扩展。因此,后续研究应重点关注无监督ROI提取方法,以提升所有患者的乳腺癌预测性能。