Batch effects are inevitable in large-scale metabolomics. Prior to formal data analysis, batch effect correction (BEC) is applied to prevent from obscuring biological variations, and batch effect evaluation (BEE) is used for correction assessment. However, existing BEE algorithms neglect covariances between the variables, and existing BEC algorithms might fail to adequately correct the covariances. Therefore, we resort to recent advancements in high-dimensional statistics, and respectively propose "quality control-based simultaneous tests (QC-ST)" and "covariance correction (CoCo)". Validated by the simulation data, QC-ST can simultaneously detect the statistical significance of QC samples' mean vectors and covariance matrices across different batches, and has a satisfactory statistical performance in empirical sizes, empirical powers, and computational speed. Then, we apply four QC-based BEC algorithms to two large cohort datasets, and find that extreme gradient boost (XGBoost) performs best in relative standard deviation (RSD) and dispersion-ratio (D-ratio). After prepositive BEC, if QC-ST still suggests that batch effects between some two batches are significant, CoCo should be implemented. And after CoCo (if necessary), the four metrics (i.e., RSD, D-ratio, classification performance, and QC-ST) might be further improved. In summary, under the guidance of QC-ST, we can develop a matching strategy to integrate multiple BEC algorithms more rationally and flexibly, and minimize batch effects for reliable biological conclusions.
翻译:大规模代谢组学研究中,批次效应不可避免。在正式数据分析前,需进行批次效应校正以防止其掩盖真实的生物学变异,并通过批次效应评估对校正效果进行检验。然而,现有批次效应评估算法忽略了变量间的协方差结构,而现有校正算法可能无法充分修正协方差。为此,我们借助高维统计学的最新进展,分别提出了“基于质量控制样本的联合检验”与“协方差校正”。仿真数据验证表明,QC-ST能够同时检测不同批次间质控样本均值向量与协方差矩阵的统计显著性,并在经验水平、经验功效及计算速度方面均表现出良好的统计性能。随后,我们将四种基于质控的批次效应校正算法应用于两个大型队列数据集,发现极端梯度提升算法在相对标准偏差与离散度比值指标上表现最优。若在初步校正后,QC-ST仍提示某些批次间存在显著效应,则应实施CoCo校正。经CoCo处理(如必要)后,四项评估指标(即RSD、D-ratio、分类性能及QC-ST)均可获得进一步改善。综上所述,在QC-ST的指导下,我们可构建一种匹配策略,以更合理灵活的方式整合多种批次效应校正算法,从而最大程度消除批次效应以获得可靠的生物学结论。