FluxSieve: Unifying Streaming and Analytical Data Planes for Scalable Cloud Observability

Despite many advances in query optimization, indexing techniques, and data storage, modern data platforms still face difficulties in delivering robust query performance under high concurrency and computationally intensive queries. This challenge is particularly pronounced in large-scale observability platforms handling high-volume, high-velocity data records. For instance, recurrent, expensive filtering queries at query time impose substantial computational and storage overheads in the analytical data plane. In this paper, we propose FluxSieve, a unified architecture that reconciles traditional pull-based query processing with push-based stream processing by embedding a lightweight in-stream precomputation and filtering layer directly into the data ingestion path. This avoids the complexity and operational burden of running queries in dedicated stream processing frameworks. Concretely, this work (i) introduces a foundational architecture that unifies streaming and analytical data planes via in-stream filtering and records enrichment, (ii) designs a scalable multi-pattern matching mechanism that supports concurrent evaluation and on-the-fly updates of filtering rules with minimal per-record overhead, (iii) demonstrates how to integrate this ingestion-time processing with two open-source analytical systems -- Apache Pinot as a Real-Time Online Analytical Processing (RTOLAP) engine and DuckDB as an embedded analytical database, and (iv) performs comprehensive experimental evaluation of our approach. Our evaluation across different systems, query types, and performance metrics shows up to orders-of-magnitude improvements in query performance at the cost of negligible additional storage and very low computational overhead.

翻译：尽管在查询优化、索引技术和数据存储方面取得了诸多进展，现代数据平台在高并发和计算密集型查询场景下，仍难以提供稳健的查询性能。这一挑战在处理海量、高速数据记录的大规模可观测性平台中尤为突出。例如，在查询时反复执行的高开销过滤查询会给分析数据平面带来巨大的计算和存储开销。本文提出FluxSieve，一种统一架构，通过在数据摄取路径中直接嵌入轻量级的流内预计算与过滤层，将传统的基于拉取的查询处理与基于推送的流处理相融合。这避免了在专用流处理框架中运行查询所带来的复杂性和运维负担。具体而言，本工作（i）提出一种基础架构，通过流内过滤与记录增强来统一流式与数据分析平面；（ii）设计一种可扩展的多模式匹配机制，支持过滤规则的并发评估与动态更新，且每条记录的开销极低；（iii）演示如何将这种摄取时处理与两个开源分析系统——作为实时在线分析处理（RTOLAP）引擎的Apache Pinot以及作为嵌入式分析数据库的DuckDB——进行集成；（iv）对本方法进行全面实验评估。我们在不同系统、查询类型和性能指标上的评估表明，该方法能以可忽略的额外存储成本和极低的计算开销为代价，实现查询性能数量级的提升。