hist2RNA: An efficient deep learning architecture to predict gene expression from breast cancer histopathology images

Gene expression can be used to subtype breast cancer with improved prediction of risk of recurrence and treatment responsiveness over that obtained using routine immunohistochemistry (IHC). However, in the clinic, molecular profiling is primarily used for ER+ cancer and is costly and tissue destructive, requires specialized platforms and takes several weeks to obtain a result. Deep learning algorithms can effectively extract morphological patterns in digital histopathology images to predict molecular phenotypes quickly and cost-effectively. We propose a new, computationally efficient approach called hist2RNA inspired by bulk RNA-sequencing techniques to predict the expression of 138 genes (incorporated from six commercially available molecular profiling tests), including luminal PAM50 subtype, from hematoxylin and eosin (H&E) stained whole slide images (WSIs). The training phase involves the aggregation of extracted features for each patient from a pretrained model to predict gene expression at the patient level using annotated H&E images from The Cancer Genome Atlas (TCGA, n=335). We demonstrate successful gene prediction on a held-out test set (n = 160, corr = 0.82 across patients, corr = 0.29 across genes) and perform exploratory analysis on an external tissue microarray (TMA) dataset (n = 498) with known IHC and survival information. Our model is able to predict gene expression and luminal PAM50 subtype (Luminal A versus Luminal B) on the TMA dataset with prognostic significance for overall survival in univariate analysis (c-index = 0.56, hazard ratio = 2.16 (95% CI 1.12-3.06), p < 5 x 10-3), and independent significance in multivariate analysis incorporating standard clinicopathological variables (c-index = 0.65, hazard ratio = 1.85 (95% CI 1.30-2.68), p < 5 x 10-3).

翻译：基因表达可对乳腺癌进行分型，相较于常规免疫组化（IHC），能更精确地预测复发风险和治疗反应性。然而在临床中，分子分型主要用于ER阳性癌症，且成本高昂、具有组织破坏性、需专用平台，结果需数周才能获得。深度学习算法能高效提取数字组织病理学图像中的形态模式，以低成本快速预测分子表型。我们提出一种名为hist2RNA的新计算高效方法，其灵感源于批量RNA测序技术，可预测138个基因（整合自六种商业分子分型检测）的表达，包括从苏木精-伊红（H&E）染色全切片图像（WSI）中预测管腔型PAM50亚型。训练阶段涉及从预训练模型中为每位患者聚合提取的特征，利用癌症基因组图谱（TCGA，n=335）的标注H&E图像在患者水平预测基因表达。我们在保留测试集（n=160，患者间相关性=0.82，基因间相关性=0.29）上验证了成功的基因预测，并在含已知IHC和生存信息的外部组织微阵列（TMA）数据集（n=498）上进行探索性分析。该模型能在TMA数据集上预测基因表达和管腔型PAM50亚型（管腔A型 vs 管腔B型），且在单变量分析中对总生存期具有预后意义（c指数=0.56，风险比=2.16（95%置信区间1.12-3.06），p<5×10⁻³），并在纳入标准临床病理变量的多变量分析中具有独立显著性（c指数=0.65，风险比=1.85（95%置信区间1.30-2.68），p<5×10⁻³）。