Network datasets appear across a wide range of scientific fields, including biology, physics, and the social sciences. To enable data-driven discoveries from these networks, statistical inference techniques like estimation and hypothesis testing are crucial. However, the size of modern networks often exceeds the storage and computational capacities of existing methods, making timely, statistically rigorous inference difficult. In this work, we introduce a subsampling-based approach aimed at reducing the computational burden associated with estimation and two-sample hypothesis testing. Our strategy involves selecting a small random subset of nodes from the network, conducting inference on the resulting subgraph, and then using interpolation based on the observed connections between the subsample and the rest of the nodes to estimate the entire graph. We develop the methodology under the generalized random dot product graph framework, which affords broad applicability and permits rigorous analysis. Within this setting, we establish consistency guarantees and corroborate the practical effectiveness of the approach through comprehensive simulation studies.
翻译:网络数据集广泛存在于生物学、物理学和社会科学等多个科学领域。为了从这些网络中实现数据驱动的发现,诸如估计和假设检验等统计推断技术至关重要。然而,现代网络的规模常常超出了现有方法的存储和计算能力,使得及时且统计严谨的推断变得困难。在本工作中,我们提出了一种基于子采样的方法,旨在减轻与估计和双样本假设检验相关的计算负担。我们的策略包括从网络中选取一个小的随机节点子集,在生成的子图上进行推断,然后基于观测到的子样本与其余节点之间的连接关系进行插值,以估计整个图。我们在广义随机点积图框架下发展了该方法学,该框架提供了广泛的适用性并允许严格的分析。在此设定下,我们建立了理论一致性保证,并通过全面的模拟研究验证了该方法的实际有效性。