Big data streams are grasping increasing attention with the development of modern science and information technology. Due to the incompatibility of limited computer memory to high volume of streaming data, real-time methods without historical data storage is worth investigating. Moreover, outliers may occur with high velocity data streams generating, calling for more robust analysis. Motivated by these concerns, a novel Online Updating Huber Robust Regression algorithm is proposed in this paper. By extracting key features of new data subsets, it obtains a computational efficient online updating estimator without historical data storage. Meanwhile, by integrating Huber regression into the framework, the estimator is robust to contaminated data streams, such as heavy-tailed or heterogeneous distributed ones as well as cases with outliers. Moreover, the proposed online updating estimator is asymptotically equivalent to Oracle estimator obtained by the entire data and has a lower computation complexity. Extensive numerical simulations and a real data analysis are also conducted to evaluate the estimation and calculation efficiency of the proposed method.
翻译:随着现代科学与信息技术的发展,大数据流日益受到关注。受限于计算机内存与海量流数据的不兼容性,无需存储历史数据的实时方法值得深入研究。此外,高速数据流生成过程中可能伴随异常值出现,这要求分析方法具有更强的稳健性。基于上述问题,本文提出了一种新型在线更新Huber稳健回归算法。通过提取新数据子集的关键特征,该算法无需存储历史数据即可获得计算高效的在线更新估计量。同时,通过将Huber回归整合至框架中,该估计量能有效应对受污染的数据流(如重尾分布、异质分布及含异常值的情形)。此外,所提出的在线更新估计量在渐近意义上等价于基于全样本数据的Oracle估计量,且具有更低计算复杂度。通过大量数值模拟与真实数据分析,本文验证了所提方法在估计精度与计算效率方面的优势。