Testing for differences in features between clusters in various applications often leads to inflated false positives when practitioners use the same dataset to identify clusters and then test features, an issue commonly known as ``double dipping''. To address this challenge, inspired by data-splitting strategies for controlling the false discovery rate (FDR) in regressions \parencite{daiFalseDiscoveryRate2023}, we present a novel method that applies data-splitting to control FDR while maintaining high power in unsupervised clustering. We first divide the dataset into two halves, then apply the conventional testing-after-clustering procedure to each half separately and combine the resulting test statistics to form a new statistic for each feature. The new statistic can help control the FDR due to its property of having a sampling distribution that is symmetric around zero for any null feature. To further enhance stability and power, we suggest multiple data splitting, which involves repeatedly splitting the data and combining results. Our proposed data-splitting methods are mathematically proven to asymptotically control FDR in Gaussian settings. Through extensive simulations and analyses of single-cell RNA sequencing (scRNA-seq) datasets, we demonstrate that the data-splitting methods are easy to implement, adaptable to existing single-cell data analysis pipelines, and often outperform other approaches when dealing with weak signals and high correlations among features.
翻译:在各类应用中,检验不同聚类间特征差异时,若研究者使用同一数据集既识别聚类又检验特征(即"双重使用"问题),往往会导致错误发现率虚高。为解决这一挑战,受回归分析中控制错误发现率的数据分割策略启发,本文提出一种新颖方法,通过数据分割在无监督聚类中实现错误发现率控制的同时保持较高检验功效。我们首先将数据集等分为两部分,分别对每个子集执行常规的聚类后检验流程,并将所得检验统计量结合构建每个特征的新统计量。该统计量具有零对称的抽样分布特性,可有效控制零假设特征对应的错误发现率。为提升方法稳定性与检验功效,我们进一步提出多重数据分割策略,通过重复分割数据并整合结果来优化性能。在高斯设定下,我们通过数学证明所提数据分割方法具有渐近错误发现率控制能力。通过大量模拟实验与单细胞RNA测序数据分析,我们证明数据分割方法易于实施,能适配现有单细胞数据分析流程,且在处理弱信号与高相关特征时通常优于其他方法。