Epidemiologists and social scientists have used the Network Scale-Up Method (NSUM) for over thirty years to estimate the size of a hidden sub-population within a social network. This method involves querying a subset of network nodes about the number of their neighbours belonging to the hidden sub-population. In general, NSUM assumes that the social network topology and the hidden sub-population distribution are well-behaved; hence, the NSUM estimate is close to the actual value. However, bounds on NSUM estimation errors have not been analytically proven. This paper provides analytical bounds on the error incurred by the two most popular NSUM estimators. These bounds assume that the queried nodes accurately provide their degree and the number of neighbors belonging to the hidden population. Our key findings are twofold. First, we show that when an adversary designs the network and places the hidden sub-population, then the estimate can be a factor of $\Omega(\sqrt{n})$ off from the real value (in a network with $n$ nodes). Second, we also prove error bounds when the underlying network is randomly generated, showing that a small constant factor can be achieved with high probability using samples of logarithmic size $O(\log{n})$. We present improved analytical bounds for Erdos-Renyi and Scale-Free networks. Our theoretical analysis is supported by an extensive set of numerical experiments designed to determine the effect of the sample size on the accuracy of the estimates in both synthetic and real networks.
翻译:流行病学家和社会科学家使用网络规模放大法(NSUM)已有三十余年,旨在估计社交网络中隐藏子群体的规模。该方法通过查询网络节点的一个子集,获取其邻居中属于隐藏子群体的数量。一般而言,NSUM假设社交网络拓扑和隐藏子群体分布具有良好的性质;因此,NSUM估计值接近真实值。然而,NSUM估计误差的界尚未得到解析证明。本文为两种最常用的NSUM估计器提供了误差的解析界。这些界假设被查询节点能准确提供其度数以及属于隐藏群体的邻居数量。我们的主要发现有两方面。首先,我们证明当对手设计网络并布置隐藏子群体时,估计值可能与真实值相差$\Omega(\sqrt{n})$倍(在一个具有$n$个节点的网络中)。其次,我们还证明了当底层网络是随机生成时的误差界,表明使用对数规模$O(\log{n})$的样本,能以高概率实现一个较小的常数因子误差。我们针对Erdos-Renyi网络和无标度网络提出了改进的解析界。我们的理论分析得到了大量数值实验的支持,这些实验旨在确定样本量对合成网络和真实网络中估计准确性的影响。