Sampling is frequently used to collect data from large networks. In this article we provide valid asymptotic prediction intervals for subgraph counts and clustering coefficient of a population network when a network sampling scheme is used to observe the population. The theory is developed under a model based framework, where it is assumed that the population network is generated by a Stochastic Block Model (SBM). We study the effects of induced and ego-centric network formation, following the initial selection of nodes by Bernoulli sampling, and establish asymptotic normality of sample based subgraph count and clustering coefficient statistic under both network formation methods. The asymptotic results are developed under a joint design and model based approach, where the effect of sampling design is not ignored. In case of the sample based clustering coefficient statistic, we find that a bias correction is required in the ego-centric case, but there is no such bias in the induced case. We also extend the asymptotic normality results for estimated subgraph counts to a mildly sparse SBM framework, where edge probabilities decay to zero at a slow rate. In this sparse setting we find that the scaling and the maximum allowable decay rate for edge probabilities depend on the choice of the target subgraph. We obtain an expression for this maximum allowable decay rate and our results suggest that the rate becomes slower if the target subgraph has more edges in a certain sense. The simulation results suggest that the proposed prediction intervals have excellent coverage, even when the node selection probability is small and unknown SBM parameters are replaced by their estimates. Finally, the proposed methodology is applied to a real data set.
翻译:抽样是收集大型网络数据的常用方法。本文提出了一种在采用网络抽样方案观测总体网络时,用于总体网络子图计数与聚类系数的有效渐近预测区间。理论发展基于模型框架,假设总体网络由随机块模型生成。我们研究了在初始通过伯努利抽样选择节点后,诱导网络与自我中心网络形成机制的影响,并建立了两种网络形成方法下基于样本的子图计数与聚类系数统计量的渐近正态性。渐近结果采用联合设计与模型框架进行分析,其中抽样设计的影响未被忽略。对于基于样本的聚类系数统计量,我们发现自我中心情形需要进行偏差校正,而诱导情形则不存在此类偏差。我们还将估计子图计数的渐近正态性结果扩展至轻度稀疏的随机块模型框架,其中连边概率以较慢速率衰减至零。在此稀疏设定下,我们发现连边概率的尺度参数与最大允许衰减速率取决于目标子图的选择。我们推导了该最大允许衰减速率的表达式,结果表明若目标子图在特定意义上具有更多连边,则该衰减速率会进一步减慢。模拟实验显示,即使节点选择概率较小且未知的随机块模型参数被其估计值替代,所提出的预测区间仍具有优异的覆盖性能。最后,我们将所提出的方法应用于实际数据集。