Auditability and reproducibility still are critical challenges for real-time data streams pipelines. Streaming engines are highly dependent on runtime scheduling, window triggers, arrival orders, and uncertainties such as network jitters. These all derive the streaming pipeline platforms to throw non-determinist outputs. In this work, we introduce a blockchain-backed provenance architecture for streaming platform (e.g Kafka Streams) the publishes cryptographic data of a windowed data stream without publishing window payloads on-chain. We used real-time weather data from weather stations in Berlin. Weather records are canonicalized, deduplicated, and aggregated per window, then serialised deterministically. Furthermore, the Merkle root of the records within the window is computed and stored alongside with Kafka offsets boundaries to MultiChain blockchain streams as checkpoints. Our design can enable an independent auditor to verify: (1) the completeness of window payloads, (2) canonical serialization, and (3) correctness of derived analytics such as minimum/maximum/average temperatures. We evaluated our system using real data stream from two weather stations (Berlin-Brandenburg and Berlin-Tempelhof) and showed linear verification cost, deterministic reproducibility, and with a scalable off-chain storage with on-chain cryptographic anchoring. We also demonstrated that the blockchain can afford to be integrated with streaming platforms particularly with our system, and we get satisfactory transactions per second values.
翻译:可审计性与可复现性仍然是实时数据流处理管道面临的关键挑战。流处理引擎高度依赖于运行时调度、窗口触发器、数据到达顺序以及网络抖动等不确定性因素。这些都导致流处理平台产生非确定性的输出。在本工作中,我们提出了一种基于区块链的流处理平台(如Kafka Streams)溯源架构,该架构在链上发布窗口化数据流的加密数据,而无需发布窗口负载内容。我们使用了来自柏林气象站的实时天气数据。天气记录经过规范化、去重和按窗口聚合后,被确定性地序列化。此外,计算窗口内记录的默克尔根,并与Kafka偏移量边界一同作为检查点存储到MultiChain区块链流中。我们的设计使得独立审计者能够验证:(1) 窗口负载的完整性,(2) 规范化序列化过程,以及(3) 衍生分析(如最小/最大/平均温度)的正确性。我们使用来自两个气象站(柏林-勃兰登堡和柏林-滕珀尔霍夫)的真实数据流评估了系统,结果表明其具有线性验证成本、确定性可复现性,以及可扩展的链下存储与链上加密锚定机制。我们还证明了区块链能够与流处理平台(特别是我们的系统)有效集成,并获得了令人满意的每秒交易处理量。