Gene expression can be used to subtype breast cancer with improved prediction of risk of recurrence and treatment responsiveness over that obtained using routine immunohistochemistry (IHC). However, in the clinic, molecular profiling is primarily used for ER+ cancer and is costly and tissue destructive, requires specialized platforms and takes several weeks to obtain a result. Deep learning algorithms can effectively extract morphological patterns in digital histopathology images to predict molecular phenotypes quickly and cost-effectively. We propose a new, computationally efficient approach called hist2RNA inspired by bulk RNA-sequencing techniques to predict the expression of 138 genes (incorporated from six commercially available molecular profiling tests), including luminal PAM50 subtype, from hematoxylin and eosin (H&E) stained whole slide images (WSIs). The training phase involves the aggregation of extracted features for each patient from a pretrained model to predict gene expression at the patient level using annotated H&E images from The Cancer Genome Atlas (TCGA, n=335). We demonstrate successful gene prediction on a held-out test set (n=160, corr=0.82 across patients, corr=0.29 across genes) and perform exploratory analysis on an external tissue microarray (TMA) dataset (n=498) with known IHC and survival information. Our model is able to predict gene expression and luminal PAM50 subtype (Luminal A versus Luminal B) on the TMA dataset with prognostic significance for overall survival in univariate analysis (c-index=0.56, hazard ratio=2.16 (95% CI 1.12-3.06), p<5x10-3), and independent significance in multivariate analysis incorporating standard clinicopathological variables (c-index=0.65, hazard ratio=1.85 (95% CI 1.30-2.68), p<5x10-3).
翻译:基因表达可用于乳腺癌分型,相比常规免疫组织化学(IHC)能更准确地预测复发风险和治疗应答。然而在临床实践中,分子分型主要用于ER阳性乳腺癌,且成本高昂、对组织有破坏性,需依赖专业平台,结果需数周才能获得。深度学习算法可有效提取数字组织病理学图像中的形态学模式,从而快速、经济地预测分子表型。我们提出了一种名为hist2RNA的新型高效计算方法,受批量RNA测序技术启发,可从苏木精-伊红(H&E)染色的全切片图像(WSI)预测138个基因(整合自六种市售分子分型检测)的表达,包括管腔PAM50亚型。训练阶段涉及从预训练模型中为每位患者聚合提取特征,利用癌症基因组图谱(TCGA,n=335)的标注H&E图像在患者层面预测基因表达。我们在保留测试集(n=160,患者间相关性corr=0.82,基因间相关性corr=0.29)上成功实现基因预测,并对具有已知IHC和生存信息的外部组织微阵列(TMA)数据集(n=498)进行探索性分析。我们的模型能够预测TMA数据集的基因表达和管腔PAM50亚型(管腔A型与管腔B型),且该预测在单变量分析中对总生存期具有预后意义(c指数=0.56,风险比=2.16(95% CI 1.12-3.06),p<5x10⁻³),在纳入标准临床病理学变量的多变量分析中保持独立显著性(c指数=0.65,风险比=1.85(95% CI 1.30-2.68),p<5x10⁻³)。