Detecting and Correcting Sample-by-Sample Scale Distortion in RNA Sequencing Data

RNA sequencing (RNA-seq) is the conventional genome-scale approach used to capture the expression levels of all detectable genes in a biological sample. This is now regularly used for population-based studies designed to identify genetic determinants of various diseases. Naturally, the accuracy of these tests should be verified and improved if possible. In this study, we aimed to detect and correct for expression level-dependent errors which vary from sample to sample, and are not corrected by conventional normalization techniques . We examined several RNA-seq datasets from the Cancer Genome Atlas (TCGA), Stand Up 2 Cancer (SU2C), and GTEx databases with various types of preprocessing. By applying local averaging, we found sample by sample expression-level dependent biases in all datasets studied. Using simulations, we show that these biases corrupt gene-gene correlation estimations and $t$ tests between subpopulations. To mitigate these biases, we introduce two different nonlinear transforms based on statistical considerations that correct these observed biases. We demonstrate that that these transforms effectively remove the observed per-sample biases, reduce sample-to-sample variance, and improve the characteristics of gene-gene correlation distributions. Using a novel simulation methodology that creates controlled differences between subpopulations, we show that these transforms reduce variability and increase sensitivity of two population tests. The improvements in sensitivity and specificity were of the order of 3-5\% in most instances after the data was corrected for bias. Altogether, these results improve our capacity to understand gene-gene relationships, and may lead to novel ways to utilize the information derived from clinical tests.

翻译：RNA测序（RNA-seq）是用于捕获生物样本中所有可检测基因表达水平的常规基因组尺度方法。该方法现已广泛应用于基于人群的研究，以识别多种疾病的遗传决定因素。因此，这些检测的准确性应当被验证并尽可能加以改进。本研究旨在检测并校正随样本变化的表达水平依赖性误差，这类误差无法通过常规标准化技术消除。我们分析了来自癌症基因组图谱（TCGA）、抗癌组织联盟（SU2C）及GTEx数据库中经过不同预处理的多个RNA-seq数据集。通过局部平均方法，我们在所有研究数据集中均发现了逐样本表达水平依赖性偏差。模拟实验表明，这些偏差会破坏基因间相关性估计及亚群间$t$检验的准确性。为缓解此偏差，我们基于统计学原理引入两种非线性变换进行校正。结果表明，这些变换可有效消除逐样本观测偏差、降低样本间方差，并改善基因间相关性分布的统计特性。我们采用一种可创建受控亚群差异的新型模拟方法，证明这些变换能降低两群体检验的变异性并提升其灵敏度。在数据偏差校正后，多数实例的灵敏度与特异度提升幅度达3-5%。综上，这些成果增强了我们理解基因间关系的能力，并可能为临床检测信息的创新性应用提供新途径。