This article develops limit laws for network sampling based estimates of subgraph counts and clustering coefficient of a large population network, and uses them for predictive inference. A model based approach is used, where the population network is assumed to be generated from a sparse Stochastic Block Model (SBM). To quantify the effects of node sampling under resource constraints, a sparse Bernoulli node sampling scheme is introduced, where the node selection probability decays to zero as the population size increases. Both induced and ego-centric network formation approaches are explored. Quantitative bounds on the speed of normal approximation for estimated subgraph counts are obtained in a joint model and design based asymptotic framework. These bounds show that inference accuracy depends on model sparsity, sampling sparsity, and features like edge density and minimum vertex cover size of the target subgraph. We find that the ego-centric approach can handle higher sparsity levels in both the model and sampling scheme, compared to the induced approach. We also show that if model sparsity remains below a threshold, inference quality is unaffected; beyond it, the quality degrades rapidly. The sufficient conditions for obtaining a Gaussian limit law also turn out to be necessary. For strictly balanced target subgraphs, we obtain sharp transitions from Gaussian to Poisson based limit laws, as sparsity levels increase. A complete description of limit laws for estimated subgraph counts is given for the induced case, with a near-complete one for the ego-centric case. These results also yield Gaussian and Poisson limit laws for the estimated clustering coefficient. Simulations support the theory across sparsity levels, and the proposed methodology is applied to a real data set.
翻译:本文建立了基于网络抽样的大规模总体网络子图计数与聚类系数估计量的极限分布理论,并用于预测性推断。采用基于模型的方法,假设总体网络由稀疏随机块模型生成。为量化资源约束下节点抽样的影响,引入稀疏伯努利节点抽样方案,其中节点选择概率随总体规模增大而衰减至零。探索了诱导式与自我中心式两种网络构建方法。在联合模型与设计的渐近框架下,获得到子图计数估计量的正态逼近速度的定量界。这些界表明推断精度取决于模型稀疏性、抽样稀疏性以及目标子图的边密度和最小顶点覆盖规模等特征。研究发现,与诱导式方法相比,自我中心式方法可处理更高的模型与抽样稀疏水平。此外,当模型稀疏性低于阈值时,推断质量不受影响;超过阈值后,质量急剧下降。高斯极限分布成立的充分条件同时被证明为必要条件。对于严格平衡的目标子图,随稀疏性增加,极限分布从高斯分布到泊松分布呈现剧烈转变。针对诱导式情形给出了子图计数估计量的极限分布的完整描述,而对自我中心式情形则给出了近乎完整的描述。这些结果也推导出聚类系数估计量的高斯与泊松极限分布。模拟实验验证了跨稀疏性水平的理论有效性,并将所提出的方法应用于实际数据集。