MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction

Breast cancer screening programmes increasingly seek to move from one-size-fits-all interval to risk-adapted and personalized strategies. Deep learning (DL) has enabled image-based risk models with stronger 1- to 5-year prediction than traditional clinical models, but leading systems (e.g., Mirai) typically use convolutional backbones, very high-resolution inputs (>1M pixels) and simple multi-view fusion, with limited explicit modelling of contralateral asymmetry. We hypothesised that combining complementary inductive biases (convolutional and transformer-based) with explicit contralateral asymmetry modelling would allow us to match state-of-the-art 3-year risk prediction performance even when operating on substantially lower-resolution mammograms, indicating that using less detailed images in a more structured way can recover state-of-the-art accuracy. We present MamaDino, a mammography-aware multi-view attentional DINO model. MamaDino fuses frozen self-supervised DINOv3 ViT-S features with a trainable CNN encoder at 512x512 resolution, and aggregates bilateral breast information via a BilateralMixer to output a 3-year breast cancer risk score. We train on 53,883 women from OPTIMAM (UK) and evaluate on matched 3-year case-control cohorts: an in-distribution test set from four screening sites and an external out-of-distribution cohort from an unseen site. At breast-level, MamaDino matches Mirai on both internal and external tests while using ~13x fewer input pixels. Adding the BilateralMixer improves discrimination to AUC 0.736 (vs 0.713) in-distribution and 0.677 (vs 0.666) out-of-distribution, with consistent performance across age, ethnicity, scanner, tumour type and grade. These findings demonstrate that explicit contralateral modelling and complementary inductive biases enable predictions that match Mirai, despite operating on substantially lower-resolution mammograms.

翻译：乳腺癌筛查项目日益寻求从一刀切的筛查间隔转向风险适应性和个性化策略。深度学习（DL）已使基于图像的风险模型在1至5年预测方面优于传统临床模型，但主流系统（如Mirai）通常使用卷积主干网络、极高分辨率输入（>100万像素）和简单的多视图融合，对双侧不对称性的显式建模有限。我们假设，将互补的归纳偏置（卷积与基于Transformer的）与显式的双侧不对称性建模相结合，即使在大幅降低分辨率的情况下处理乳腺X光片，也能达到最先进的三年风险预测性能，这表明以更具结构化的方式使用较不详细的图像可以恢复最先进的准确性。我们提出MamaDino，一种乳腺X光感知的多视图注意力DINO模型。MamaDino在512x512分辨率下融合了冻结的自监督DINOv3 ViT-S特征与可训练的CNN编码器，并通过BilateralMixer聚合双侧乳腺信息，输出三年乳腺癌风险评分。我们使用来自OPTIMAM（英国）的53,883名女性数据进行训练，并在匹配的三年病例对照队列上评估：来自四个筛查站点的分布内测试集和来自未见站点的外部分布外队列。在乳腺层面，MamaDino在内部和外部测试中均与Mirai相当，同时使用的输入像素减少了约13倍。添加BilateralMixer将分布内区分度提升至AUC 0.736（对比0.713），分布外提升至0.677（对比0.666），且在年龄、种族、扫描仪、肿瘤类型和分级方面表现一致。这些发现表明，尽管在大幅降低分辨率的乳腺X光片上运行，显式的双侧建模和互补的归纳偏置仍能实现与Mirai相当的预测性能。