Post-clustering inference in single-cell RNA sequencing (scRNA-seq) analysis presents significant challenges in controlling Type I error during differential expression analysis. Data fission, a promising approach that aims to split data into two independent parts, relies on strong parametric assumptions of non-mixture distributions that are inherently violated in clustered data. To address this limitation, we introduce conditional data fission, an extension designed to decompose each mixture component into two independent parts. However, we demonstrate that applying such conditional data fission to mixture distributions requires prior knowledge of the clustering structure to ensure valid post-clustering inference. This arises from the need to accurately estimate component-specific scale parameters, which are critical for performing decomposition while maintaining independence. We theoretically quantify how biases in estimating these parameters lead to inflated Type I error rates due to deviations from independence. Given that mixture components are typically unknown in practice, our results underscore the fundamental difficulty of applying data fission in real-world settings, despite its prior proposal as a solution for post-clustering inference.
翻译:单细胞RNA测序(scRNA-seq)分析中的后聚类推断在差异表达分析中面临控制I类错误的重大挑战。数据裂变作为一种有前景的方法,旨在将数据分解为两个独立部分,但其依赖于非混合分布这一强参数假设,而该假设在聚类数据中天然不成立。为克服此局限,我们提出条件数据裂变——这一扩展方法旨在将每个混合成分分解为两个独立部分。然而,我们证明将此类条件数据裂变应用于混合分布时,需要预先掌握聚类结构才能确保有效的后聚类推断。这是因为需要准确估计成分特有的尺度参数——这些参数在执行分解并保持独立性的过程中至关重要。我们从理论上量化了这些参数估计偏差如何因偏离独立性而导致I类错误率膨胀。鉴于实践中的混合成分通常未知,我们的结果揭示了数据裂变在实际场景中应用的根本性困难,尽管其先前曾被提出作为后聚类推断的解决方案。