Integrating data from different platforms, such as bulk and single-cell RNA sequencing, is crucial for improving the accuracy and interpretability of complex biological analyses like cell type deconvolution. However, this task is complicated by measurement and biological heterogeneity between target and reference datasets. For the problem of cell type deconvolution, existing methods often neglect the correlation and uncertainty in cell type proportion estimates, possibly leading to an additional concern of false positives in downstream comparisons across multiple individuals. We introduce MEAD, a comprehensive statistical framework that not only estimates cell type proportions but also provides asymptotically valid statistical inference on the estimates. One of our key contributions is the identifiability result, which rigorously establishes the conditions under which cell type proportions are identifiable despite arbitrary heterogeneity of measurement biases between platforms. MEAD also supports the comparison of cell type proportions across individuals after deconvolution, accounting for gene-gene correlations and biological variability. Through simulations and real-data analysis, MEAD demonstrates superior reliability for inferring cell type compositions in complex biological systems.
翻译:整合不同平台的数据,例如批量RNA测序和单细胞RNA测序,对于提高细胞类型反卷积等复杂生物学分析的准确性和可解释性至关重要。然而,目标数据集与参考数据集之间存在的测量异质性和生物学异质性使得这一任务变得复杂。针对细胞类型反卷积问题,现有方法常常忽略细胞类型比例估计中的相关性和不确定性,这可能导致在后续跨多个个体的比较中出现假阳性的额外问题。我们提出了MEAD,一个全面的统计框架,它不仅能够估计细胞类型比例,还能为这些估计提供渐近有效的统计推断。我们的一个关键贡献是可识别性结果,它严格确立了在不同平台间测量偏倚存在任意异质性的情况下,细胞类型比例仍可被识别的条件。MEAD还支持在反卷积后跨个体比较细胞类型比例,同时考虑了基因-基因相关性和生物学变异性。通过模拟和真实数据分析,MEAD在推断复杂生物系统中的细胞类型组成方面展现出卓越的可靠性。