Given the ubiquity of modularity in biological systems, module-level regulation analysis is vital for understanding biological systems across various levels and their dynamics. Current statistical analysis on biological modules predominantly focuses on either detecting the functional modules in biological networks or sub-group regression on the biological features without using the network data. This paper proposes a novel network-based neighborhood regression framework whose regression functions depend on both the global community-level information and local connectivity structures among entities. An efficient community-wise least square optimization approach is developed to uncover the strength of regulation among the network modules while enabling asymptotic inference. With random graph theory, we derive non-asymptotic estimation error bounds for the proposed estimator, achieving exact minimax optimality. Unlike the root-n consistency typical in canonical linear regression, our model exhibits linear consistency in the number of nodes n, highlighting the advantage of incorporating neighborhood information. The effectiveness of the proposed framework is further supported by extensive numerical experiments. Application to whole-exome sequencing and RNA-sequencing Autism datasets demonstrates the usage of the proposed method in identifying the association between the gene modules of genetic variations and the gene modules of genomic differential expressions.
翻译:鉴于模块性在生物系统中的普遍存在,模块层面的调控分析对于理解不同层次的生物系统及其动态至关重要。当前针对生物模块的统计分析主要集中于检测生物网络中的功能模块,或在不使用网络数据的情况下对生物特征进行亚组回归。本文提出了一种新颖的基于网络的邻域回归框架,其回归函数同时依赖于全局群落层面的信息以及实体间的局部连接结构。我们开发了一种高效的群落最小二乘优化方法,以揭示网络模块间调控的强度,同时支持渐近推断。借助随机图理论,我们推导了所提出估计量的非渐近估计误差界,达到了精确的极小极大最优性。与经典线性回归中典型的根号n一致性不同,我们的模型在节点数n上表现出线性一致性,这凸显了融入邻域信息的优势。大量数值实验进一步支持了所提出框架的有效性。在全外显子组测序和RNA测序自闭症数据集上的应用,展示了所提方法在识别遗传变异基因模块与基因组差异表达基因模块之间关联方面的用途。