Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data

Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG (p>0.85) and differentially expressed genes (DEG, p<0.05) were selected based on the p values of ANOVA analysis and used for subsequent data normalization and classification, respectively. Models trained based on data from one platform were used for testing on the other platform. Our data show that NDEG and DEG gene selection could effectively improve the model classification performance. Normalization methods based on parametric statistical analysis were inferior to those based on nonparametric statistics. In this study, the LOG_QN and LOG_QNZ normalization methods combined with the neural network classification model seem to achieve better performance. Therefore, NDEG-based normalization appears useful for cross-platform testing on completely independent datasets. However, more studies are required to examine whether NDEG-based normalization can improve ML classification performance in other datasets and other omic data types.

翻译：标准化是生物学过程定量分析的关键步骤。近期研究表明，跨平台整合与标准化使得在RNA微阵列和RNA-seq数据上进行机器学习训练成为可能，但这些研究均未使用独立数据集。因此，如何提升基于独立RNA芯片和RNA-seq数据集的机器学习建模性能尚不明确。受实验生物学中常用管家基因的启发，本研究验证以下假设：非差异表达基因可能改善转录组数据的标准化效果，进而提升机器学习模型的跨平台建模性能。本研究分别采用TCGA乳腺癌项目的微阵列和RNA-seq数据集作为独立的训练集与测试集，以进行乳腺癌分子亚型分类。基于ANOVA分析的p值筛选非差异表达基因（NDEG，p>0.85）与差异表达基因（DEG，p<0.05），并分别用于后续的数据标准化与分类任务。使用基于单一平台数据训练的模型在另一平台数据进行测试。实验数据表明，NDEG与DEG基因筛选能有效提升模型分类性能。基于参数统计分析的标准化方法效果逊于非参数统计方法。本研究中，LOG_QN与LOG_QNZ标准化方法结合神经网络分类模型展现出更优性能。因此，基于NDEG的标准化方法对于完全独立数据集的跨平台测试具有实用价值。然而，仍需更多研究验证基于NDEG的标准化能否在其他数据集及其他组学数据类型中提升机器学习分类性能。