Subsampling algorithms for various parametric regression models with massive data have been extensively investigated in recent years. However, all existing studies on subsampling heavily rely on clean massive data. In practical applications, the observed covariates may suffer from inaccuracies due to measurement errors. To address the challenge of large datasets with measurement errors, this study explores two subsampling algorithms based on the corrected likelihood approach: the optimal subsampling algorithm utilizing inverse probability weighting and the perturbation subsampling algorithm employing random weighting assuming a perfectly known distribution. Theoretical properties for both algorithms are provided. Numerical simulations and two real-world examples demonstrate the effectiveness of these proposed methods compared to other uncorrected algorithms.
翻译:针对海量数据下各类参数回归模型的子抽样算法近年来已得到广泛研究。然而,现有子抽样研究均基于干净的大规模数据。在实际应用中,观测协变量可能因测量误差而存在不准确性。为应对含测量误差的大数据挑战,本研究基于校正似然方法探索了两种子抽样算法:一是利用逆概率加权的最优子抽样算法,二是基于假定完全已知分布采用随机加权的扰动子抽样算法。本文给出了两种算法的理论性质,并通过数值模拟与两个实际案例验证了所提方法相较于其他未校正算法的有效性。