The Wreaths of KHAN: Uniform Graph Feature Selection with False Discovery Rate Control

Graphical models find numerous applications in biology, chemistry, sociology, neuroscience, etc. While substantial progress has been made in graph estimation, it remains largely unexplored how to select significant graph signals with uncertainty assessment, especially those graph features related to topological structures including cycles (i.e., wreaths), cliques, hubs, etc. These features play a vital role in protein substructure analysis, drug molecular design, and brain network connectivity analysis. To fill the gap, we propose a novel inferential framework for general high dimensional graphical models to select graph features with false discovery rate controlled. Our method is based on the maximum of $p$-values from single edges that comprise the topological feature of interest, thus is able to detect weak signals. Moreover, we introduce the $K$-dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within $K$ dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group. We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the $p$-value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition.

翻译：图模型在生物学、化学、社会学、神经科学等领域具有广泛应用。尽管图估计研究已取得显著进展，但如何在不确性评估下选择重要图信号（尤其是与环（即花环）、团簇、枢纽等拓扑结构相关的图特征）仍鲜有探索。这些特征在蛋白质亚结构分析、药物分子设计及脑网络连接性分析中发挥着关键作用。为填补这一空白，我们提出一个针对一般高维图模型的新型推断框架，可在控制错误发现率的前提下选择图特征。该方法基于构成目标拓扑特征的单边$p$值最大值，因此能够检测弱信号。此外，我们引入$K$维持续同调自适应选择（KHAN）算法，通过对连续滤波水平的错误发现率进行均匀控制，选择所有$K$维同调特征。KHAN方法采用新颖的离散Gram-Schmidt算法，从同调群中选择具有统计显著性的生成元。我们应用结构筛选方法识别SARS-CoV-2刺突蛋白与ACE2受体结合过程中的重要残基。通过闭锁态、部分开放态和开放态下网络持续同调中$p$值加权滤波水平对刺突蛋白所有结构域残基进行评分，我们识别出对蛋白质构象变化至关重要的残基，从而为抑制性药物设计提供潜在靶点。