MultiChain Blockchain Data Provenance for Deterministic Stream Processing with Kafka Streams: A Weather Data Case Study

Auditability and reproducibility still are critical challenges for real-time data streams pipelines. Streaming engines are highly dependent on runtime scheduling, window triggers, arrival orders, and uncertainties such as network jitters. These all derive the streaming pipeline platforms to throw non-determinist outputs. In this work, we introduce a blockchain-backed provenance architecture for streaming platform (e.g Kafka Streams) the publishes cryptographic data of a windowed data stream without publishing window payloads on-chain. We used real-time weather data from weather stations in Berlin. Weather records are canonicalized, deduplicated, and aggregated per window, then serialised deterministically. Furthermore, the Merkle root of the records within the window is computed and stored alongside with Kafka offsets boundaries to MultiChain blockchain streams as checkpoints. Our design can enable an independent auditor to verify: (1) the completeness of window payloads, (2) canonical serialization, and (3) correctness of derived analytics such as minimum/maximum/average temperatures. We evaluated our system using real data stream from two weather stations (Berlin-Brandenburg and Berlin-Tempelhof) and showed linear verification cost, deterministic reproducibility, and with a scalable off-chain storage with on-chain cryptographic anchoring. We also demonstrated that the blockchain can afford to be integrated with streaming platforms particularly with our system, and we get satisfactory transactions per second values.

翻译：可审计性与可复现性仍然是实时数据流处理管道面临的关键挑战。流处理引擎高度依赖于运行时调度、窗口触发器、数据到达顺序以及网络抖动等不确定性因素。这些都导致流处理平台产生非确定性的输出。在本工作中，我们提出了一种基于区块链的流处理平台（如Kafka Streams）溯源架构，该架构在链上发布窗口化数据流的加密数据，而无需发布窗口负载内容。我们使用了来自柏林气象站的实时天气数据。天气记录经过规范化、去重和按窗口聚合后，被确定性地序列化。此外，计算窗口内记录的默克尔根，并与Kafka偏移量边界一同作为检查点存储到MultiChain区块链流中。我们的设计使得独立审计者能够验证：(1) 窗口负载的完整性，(2) 规范化序列化过程，以及(3) 衍生分析（如最小/最大/平均温度）的正确性。我们使用来自两个气象站（柏林-勃兰登堡和柏林-滕珀尔霍夫）的真实数据流评估了系统，结果表明其具有线性验证成本、确定性可复现性，以及可扩展的链下存储与链上加密锚定机制。我们还证明了区块链能够与流处理平台（特别是我们的系统）有效集成，并获得了令人满意的每秒交易处理量。

相关内容

Kafka

关注 162

Kafka是一种高吞吐量的分布式发布订阅消息系统，它可以处理消费者规模的网站中的所有动作流数据。这种动作（网页浏览，搜索和其他用户的行动）是在现代网络上的许多社会功能的一个关键因素。这些数据通常是由于吞吐量的要求而通过处理日志和日志聚合来解决。对于像Hadoop的一样的日志数据和离线分析系统，但又要求实时处理的限制，这是一个可行的解决方案。Kafka的目的是通过Hadoop的并行加载机制来统一线上和离线的消息处理，也是为了通过集群来提供实时的消费。

《基于战术区块链提供数据来源以支持战场物联网和大数据分析》美海军2022最新94页论文

专知会员服务

76+阅读 · 2022年12月13日

【军用区块链+复杂系统】《数据信任方法学：基于区块链的军事复杂系统检测》麻省理工林肯实验室

专知会员服务

56+阅读 · 2022年6月11日