Computational capability often falls short when confronted with massive data, posing a common challenge in establishing a statistical model or statistical inference method dealing with big data. While subsampling techniques have been extensively developed to downsize the data volume, there is a notable gap in addressing the unique challenge of handling extensive reliability data, in which a common situation is that a large proportion of data is censored. In this article, we propose an efficient subsampling method for reliability analysis in the presence of censoring data, intending to estimate the parameters of lifetime distribution. Moreover, a novel subsampling method for subsampling from severely censored data is proposed, i.e., only a tiny proportion of data is complete. The subsampling-based estimators are given, and their asymptotic properties are derived. The optimal subsampling probabilities are derived through the L-optimality criterion, which minimizes the trace of the product of the asymptotic covariance matrix and a constant matrix. Efficient algorithms are proposed to implement the proposed subsampling methods to address the challenge that optimal subsampling strategy depends on unknown parameter estimation from full data. Real-world hard drive dataset case and simulative empirical studies are employed to demonstrate the superior performance of the proposed methods.
翻译:当面对海量数据时,计算能力常常不足,这是在建立处理大数据的统计模型或统计推断方法时面临的普遍挑战。虽然子抽样技术已被广泛开发以缩减数据量,但在处理大量可靠性数据这一独特挑战方面存在显著空白,其中常见情况是很大比例的数据是删失的。本文提出了一种在存在删失数据情况下进行可靠性分析的高效子抽样方法,旨在估计寿命分布的参数。此外,提出了一种从严重删失数据(即仅有极小比例数据是完整的)中进行抽样的新型子抽样方法。给出了基于子抽样的估计量,并推导了它们的渐近性质。通过L-最优性准则推导了最优子抽样概率,该准则最小化渐近协方差矩阵与一个常数矩阵乘积的迹。提出了高效算法来实现所提出的子抽样方法,以应对最优子抽样策略依赖于从全数据中获得的未知参数估计这一挑战。采用真实世界硬盘驱动器数据集案例和模拟实证研究,证明了所提出方法的优越性能。