Gene expression can be used to subtype breast cancer with improved prediction of risk of recurrence and treatment responsiveness over that obtained using routine immunohistochemistry (IHC). However, in the clinic, molecular profiling is primarily used for ER+ cancer and is costly and tissue destructive, requires specialized platforms and takes several weeks to obtain a result. Deep learning algorithms can effectively extract morphological patterns in digital histopathology images to predict molecular phenotypes quickly and cost-effectively. We propose a new, computationally efficient approach called hist2RNA inspired by bulk RNA-sequencing techniques to predict the expression of 138 genes (incorporated from six commercially available molecular profiling tests), including luminal PAM50 subtype, from hematoxylin and eosin (H&E) stained whole slide images (WSIs). The training phase involves the aggregation of extracted features for each patient from a pretrained model to predict gene expression at the patient level using annotated H&E images from The Cancer Genome Atlas (TCGA, n=335). We demonstrate successful gene prediction on a held-out test set (n=160, corr=0.82 across patients, corr=0.29 across genes) and perform exploratory analysis on an external tissue microarray (TMA) dataset (n=498) with known IHC and survival information. Our model is able to predict gene expression and luminal PAM50 subtype (Luminal A versus Luminal B) on the TMA dataset with prognostic significance for overall survival in univariate analysis (c-index=0.56, hazard ratio=2.16, p<0.005), and independent significance in multivariate analysis incorporating standard clinicopathological variables (c-index=0.65, hazard ratio=1.85, p<0.005).
翻译:基因表达可用于乳腺癌分子分型,相较于常规免疫组织化学(IHC)方法,能更精准预测复发风险和治疗反应性。然而在临床实践中,分子检测主要应用于雌激素受体阳性(ER+)乳腺癌,且存在成本高昂、组织破坏性强、需专用平台以及数周才能获得结果等局限。深度学习算法可有效提取数字组织病理学图像中的形态模式,实现分子表型的快速低成本预测。我们提出了一种受批量RNA测序技术启发的新型高效计算方法hist2RNA,用于从苏木精-伊红(H&E)染色全切片图像(WSIs)中预测138个基因(整合自六种市售分子检测产品)的表达水平,包括管腔PAM50分子亚型。训练阶段通过为每位患者从预训练模型中聚合提取的特征,利用癌症基因组图谱(TCGA,n=335)的标注H&E图像在患者层面实现基因表达预测。我们在独立测试集(n=160,患者间相关系数=0.82,基因间相关系数=0.29)上成功验证基因预测性能,并在具有已知IHC和生存信息的外部组织微阵列(TMA)数据集(n=498)上进行探索性分析。本模型可预测TMA数据集中具有预后意义的基因表达与管腔PAM50亚型(Luminal A型与Luminal B型),单变量分析显示其对总生存期具有预后价值(C指数=0.56,风险比=2.16,p<0.005),纳入常规临床病理学变量的多变量分析中亦具有独立显著性(C指数=0.65,风险比=1.85,p<0.005)。