Statistical analysis of large datasets is a challenge because of the limitation of computing devices' memory and excessive computation time. Divide and Conquer (DC) algorithm is an effective solution path, but the DC algorithm still has limitations for statistical inference. Empirical likelihood is an important semiparametric and nonparametric statistical method for parameter estimation and statistical inference, and the estimating equation builds a bridge between empirical likelihood and traditional statistical methods, which makes empirical likelihood widely used in various traditional statistical models. In this paper, we propose a novel approach to address the challenges posed by empirical likelihood with massive data, which is called split sample mean empirical likelihood(SSMEL), our approach provides a unique perspective for sovling big data problem. We show that the SSMEL estimator has the same estimation efficiency as the empirical likelihood estimator with the full dataset, and maintains the important statistical property of Wilks' theorem, allowing our proposed approach to be used for statistical inference. The effectiveness of the proposed approach is illustrated using simulation studies and real data analysis.
翻译:大规模数据集的统计分析因计算设备内存限制和过长的计算时间而面临挑战。分治(Divide and Conquer, DC)算法是一种有效的解决路径,但该算法在统计推断方面仍存在局限性。经验似然是一种重要的半参数和非参数统计方法,可用于参数估计与统计推断,而估计方程在经验似然与传统统计方法之间建立了桥梁,这使得经验似然在各类传统统计模型中得到广泛应用。本文提出一种新颖方法以应对大规模数据下经验似然所面临的挑战,即分裂样本均值经验似然(Split Sample Mean Empirical Likelihood, SSMEL),该方法为解决大数据问题提供了独特视角。我们证明了SSMEL估计量具有与基于完整数据集的经验似然估计量相同的估计效率,并保留了威尔克斯定理(Wilks' theorem)的重要统计性质,从而使所提方法可用于统计推断。通过模拟研究和实际数据分析验证了该方法的有效性。