Streaming data systems increasingly underpin Machine Learning workflows that maintain large numbers of continuously updated aggregations. In production settings, each incoming event typically triggers read-modify-write operations to persistent storage, making high-frequency state updates a dominant source of latency, contention, and operational cost. In this work, we decouple inference from state persistence in streaming Machine Learning pipelines via probabilistic thinning: every event is scored, but durable state updates are selectively triggered by informative events. Unlike approaches that shed input or state, we show that persistence-path control is achievable without a high-frequency in-memory control plane or cross-worker coordination, relying exclusively on approximate statistics retrieved from disk-backed key-value stores. We model the resulting stochastic processes, derive bounds on filtering rates, and prove that common time-based aggregations remain unbiased under variance-aware formulations, preventing systemic error accumulation. We evaluate the approach in a controlled setting that isolates per-event costs, demonstrating substantial reductions in storage Input/Output and serialization overhead. Across experiments, up to 90% of events are excluded from the persistence path while preserving and in some cases improving downstream utility.
翻译:流式数据系统日益支撑着维护大量连续更新聚合的机器学习工作流。在生产环境中,每个传入事件通常会触发对持久化存储的读-修改-写操作,使高频状态更新成为延迟、争用和运营成本的主要来源。在本工作中,我们通过概率稀疏化将推理与状态持久化在流式机器学习流水线中解耦:每个事件都会被评分,但持久化状态更新仅由富含信息的事件选择性触发。与丢弃输入或状态的方法不同,我们证明持久化路径控制可以在无需高频内存控制平面或跨工作节点协调的情况下实现,仅依赖从磁盘键值存储中检索的近似统计量。我们对所产生的随机过程进行建模,推导过滤率的边界,并证明在考虑方差的情形下,常见的基于时间的聚合仍保持无偏性,从而防止系统性误差累积。我们在隔离每事件成本的受控环境中评估该方法,展示了存储输入/输出和序列化开销的大幅降低。在实验中,高达90%的事件被排除在持久化路径之外,同时保持甚至提升下游效用。