In the realm of data stream processing, the advent of SET-INCREMENT Mixed (SIM) data streams necessitates algorithms that efficiently handle both SET and INCREMENT operations. We present Carbonyl4, an innovative algorithm designed specifically for SIM data streams, ensuring accuracy, unbiasedness, and adaptability. Carbonyl4 introduces two pioneering techniques: the Balance Bucket for refined variance optimization, and the Cascading Overflow for maintaining precision amidst overflow scenarios. Our experiments across four diverse datasets establish Carbonyl4's supremacy over existing algorithms, particularly in terms of accuracy for item-level information retrieval and adaptability to fluctuating memory requirements. The versatility of Carbonyl4 is further demonstrated through its dynamic memory shrinking capability, achieved via a re-sampling and a heuristic approach. The source codes of Carbonyl4 are available at GitHub.
翻译:在数据流处理领域,集合-增量混合(SIM)数据流的出现要求算法能够高效处理集合(SET)与增量(INCREMENT)两种操作。本文提出Carbonyl4,一种专为SIM数据流设计的创新算法,确保准确性、无偏性和适应性。Carbonyl4引入了两项开创性技术:用于精细化方差优化的平衡桶(Balance Bucket),以及在溢出场景下保持精度的级联溢出(Cascading Overflow)机制。我们在四个不同数据集上的实验表明,Carbonyl4在现有算法中具有显著优势,尤其在项目级信息检索的准确性及对动态内存需求的适应能力方面。Carbonyl4通过重采样与启发式方法实现动态内存缩减,进一步证明了其多功能性。Carbonyl4的源代码已在GitHub上开源。