The rapid growth of online network platforms generates large-scale network data and it poses great challenges for statistical analysis using the spatial autoregression (SAR) model. In this work, we develop a novel distributed estimation and statistical inference framework for the SAR model on a distributed system. We first propose a distributed network least squares approximation (DNLSA) method. This enables us to obtain a one-step estimator by taking a weighted average of local estimators on each worker. Afterwards, a refined two-step estimation is designed to further reduce the estimation bias. For statistical inference, we utilize a random projection method to reduce the expensive communication cost. Theoretically, we show the consistency and asymptotic normality of both the one-step and two-step estimators. In addition, we provide theoretical guarantee of the distributed statistical inference procedure. The theoretical findings and computational advantages are validated by several numerical simulations implemented on the Spark system. Lastly, an experiment on the Yelp dataset further illustrates the usefulness of the proposed methodology.
翻译:在线网络平台的快速增长产生了大规模网络数据,这对使用空间自回归(SAR)模型进行统计分析提出了巨大挑战。本文针对分布式系统中的SAR模型,提出了一种新型的分布式估计与统计推断框架。首先,我们提出了一种分布式网络最小二乘近似(DNLSA)方法,通过对各工作节点上的局部估计量进行加权平均,得到一步估计量。随后,设计了一种改进的两步估计方法以进一步减少估计偏差。在统计推断方面,我们采用随机投影方法降低昂贵的通信开销。理论上,我们证明了一步估计量和两步估计量的一致性和渐近正态性,并为分布式统计推断过程提供了理论保证。通过在Spark系统上实施的数值仿真验证了理论发现和计算优势。最后,基于Yelp数据集的实验进一步说明了所提方法的实用性。