We introduce DiffKnock, a diffusion-based knockoff framework for high-dimensional feature selection with finite-sample false discovery rate (FDR) control. DiffKnock addresses two key limitations of existing knockoff methods: preserving complex feature dependencies and detecting non-linear associations. Our approach trains diffusion models to generate valid knockoffs and uses neural network--based gradient and filter statistics to construct antisymmetric feature importance measures. Through simulations, we showed that DiffKnock achieved higher power than autoencoder-based knockoffs while maintaining target FDR, indicating its superior performance in scenarios involving complex non-linear architectures. Applied to murine single-cell RNA-seq data of LPS-stimulated macrophages, DiffKnock identifies canonical NF-$\kappa$B target genes (Ccl3, Hmox1) and regulators (Fosb, Pdgfb). These results highlight that, by combining the flexibility of deep generative models with rigorous statistical guarantees, DiffKnock is a powerful and reliable tool for analyzing single-cell RNA-seq data, as well as high-dimensional and structured data in other domains.
翻译:我们提出了DiffKnock,一种基于扩散的敲除框架,用于高维特征选择,并具备有限样本错误发现率(FDR)控制能力。DiffKnock解决了现有敲除方法的两个关键局限:保持复杂的特征依赖关系以及检测非线性关联。我们的方法训练扩散模型以生成有效的敲除样本,并利用基于神经网络的梯度和过滤统计量构建反对称的特征重要性度量。通过模拟实验,我们证明DiffKnock在维持目标FDR的同时,比基于自编码器的敲除方法获得了更高的检验功效,表明其在涉及复杂非线性架构的场景中具有更优性能。将DiffKnock应用于LPS刺激巨噬细胞的小鼠单细胞RNA-seq数据,该方法识别出了经典的NF-$\kappa$B靶基因(Ccl3, Hmox1)及其调控因子(Fosb, Pdgfb)。这些结果表明,通过将深度生成模型的灵活性与严格的统计保证相结合,DiffKnock成为分析单细胞RNA-seq数据以及其他领域高维结构化数据的一个强大而可靠的工具。