Post-clustering inference in scRNA-seq analysis presents significant challenges in controlling Type I error during Differential Expression Analysis. Data fission, a promising approach, aims to split the data into two new independent parts, but relies on strong parametric assumptions of non-mixture distributions, which are violated in clustered data. We show that applying data fission to these mixtures requires knowledge of the clustering structure to accurately estimate component-specific scale parameters. These estimates are critical for ensuring decomposition and independence. We theoretically quantify the direct impact of the bias in estimating this scales parameters on the inflation of the Type I error rate, caused by a deviation from the independence. Since component structures are unknown in practice, we propose a heteroscedastic model with non-parametric estimators for individual scale parameters. This model uses proximity between observations to capture the effect of the underlying mixture on data dispersion. While this approach works well when clusters are well-separated, it introduces bias when separation is weak, highlighting the difficulty of applying data fission in real-world scenarios with unknown degrees of separation.
翻译:单细胞RNA测序分析中的聚类后推断在差异表达分析期间控制I类错误方面面临重大挑战。数据裂变作为一种有前景的方法,旨在将数据拆分为两个新的独立部分,但其依赖于非混合分布的强参数假设,而这一假设在聚类数据中并不成立。我们证明,将数据裂变应用于这些混合分布需要了解聚类结构,以准确估计组分特定的尺度参数。这些估计对于确保分解和独立性至关重要。我们从理论上量化了估计这些尺度参数时的偏差对I类错误率膨胀的直接影响,这种膨胀是由独立性偏离引起的。由于组分结构在实践中是未知的,我们提出了一种异方差模型,该模型使用非参数估计器来估计个体尺度参数。该模型利用观测值之间的邻近性来捕捉底层混合分布对数据离散度的影响。虽然这种方法在聚类分离良好时效果显著,但在分离较弱时会引入偏差,这凸显了在分离程度未知的现实场景中应用数据裂变的困难。