The Internet of Things (IoT) system generates massive high-speed temporally correlated streaming data and is often connected with online inference tasks under computational or energy constraints. Online analysis of these streaming time series data often faces a trade-off between statistical efficiency and computational cost. One important approach to balance this trade-off is sampling, where only a small portion of the sample is selected for the model fitting and update. Motivated by the demands of dynamic relationship analysis of IoT system, we study the data-dependent sample selection and online inference problem for a multi-dimensional streaming time series, aiming to provide low-cost real-time analysis of high-speed power grid electricity consumption data. Inspired by D-optimality criterion in design of experiments, we propose a class of online data reduction methods that achieve an optimal sampling criterion and improve the computational efficiency of the online analysis. We show that the optimal solution amounts to a strategy that is a mixture of Bernoulli sampling and leverage score sampling. The leverage score sampling involves auxiliary estimations that have a computational advantage over recursive least squares updates. Theoretical properties of the auxiliary estimations involved are also discussed. When applied to European power grid consumption data, the proposed leverage score based sampling methods outperform the benchmark sampling method in online estimation and prediction. The general applicability of the sampling-assisted online estimation method is assessed via simulation studies.
翻译:物联网系统生成大量高速、时间相关的流式数据,并常连接至受计算或能量约束的在线推断任务。这类流式时间序列数据的在线分析通常面临统计效率与计算成本之间的权衡。平衡这一权衡的重要方法之一是抽样,即仅选取少量样本用于模型拟合与更新。受物联网系统动态关系分析需求的驱动,本文针对多维流式时间序列研究数据依赖的样本选取与在线推断问题,旨在为高速电网电力消费数据提供低成本的实时分析。受实验设计中D-最优性准则的启发,我们提出一类在线数据降维方法,该方法实现了最优抽样准则,并提升了在线分析的计算效率。我们证明最优解等价于伯努利抽样与杠杆评分抽样的混合策略。杠杆评分抽样涉及辅助估计,其计算优势优于递归最小二乘更新。本文还讨论了相关辅助估计的理论性质。当应用于欧洲电网消费数据时,所提出的基于杠杆评分的抽样方法在线估计与预测中优于基准抽样方法。通过模拟研究评估了抽样辅助在线估计方法的普适适用性。