Data stream algorithms tackle operations on high-volume sequences of read-once data items. Data stream scenarios include inherently real-time systems like sensor networks and financial markets. They also arise in purely-computational scenarios like ordered traversal of big data or long-running iterative simulations. In this work, we develop methods to maintain running archives of stream data that are temporally representative, a task we call "stream curation." Our approach contributes to rich existing literature on data stream binning, which we extend by providing stateless (i.e., non-iterative) curation schemes that enable key optimizations to trim archive storage overhead and streamline processing of incoming observations. We also broaden support to cover new trade-offs between curated archive size and temporal coverage. We present a suite of five stream curation algorithms that span $\mathcal{O}(n)$, $\mathcal{O}(\log n)$, and $\mathcal{O}(1)$ orders of growth for retained data items. Within each order of growth, algorithms are provided to maintain even coverage across history or bias coverage toward more recent time points. More broadly, memory-efficient stream curation can boost the data stream mining capabilities of low-grade hardware in roles such as sensor nodes and data logging devices.
翻译:数据流算法处理对海量一次读取数据项序列的操作。数据流场景包括传感器网络和金融市场等固有实时系统,也出现在纯计算场景中,如大数据的有序遍历或长期运行的迭代模拟。在本文中,我们开发了维护具有时间代表性的流数据运行存档的方法,将这一任务称为"流策展"。我们的方法丰富了现有丰富的数据流分箱文献,通过提供无状态(即非迭代)策展方案,实现了关键优化以缩减存档存储开销并简化传入观测的处理流程。我们还扩展了支持范围,涵盖了策展存档大小与时间覆盖之间的新权衡。我们提出了一套包含五种流策展算法的方案,覆盖保留数据项$\mathcal{O}(n)$、$\mathcal{O}(\log n)$和$\mathcal{O}(1)$的增长阶。在每个增长阶内,所提供的算法可实现在历史中的均匀覆盖或偏向近期时间点的覆盖。更广泛地,内存高效的流策展能够提升传感器节点和数据记录设备等低等级硬件的数据流挖掘能力。