Real-time Event Joining in Practice With Kafka and Flink

Historically, machine learning training pipelines have predominantly relied on batch training models, retraining models every few hours. However, industrial practitioners have proved that real-time training can lead to a more adaptive and personalized user experience. The transition from batch to real-time is full of tradeoffs to get the benefits of accuracy and freshness while keeping the costs low and having a predictable, maintainable system. Our work characterizes migrating to a streaming pipeline for a machine learning model using Apache Kafka and Flink. We demonstrate how to transition from Google Pub/Sub to Kafka to handle incoming real-time events and leverage Flink for streaming joins using RocksDB and checkpointing. We also address challenges such as managing causal dependencies between events, balancing event time versus processing time, and ensuring exactly-once versus at-least-once delivery guarantees, among other issues. Furthermore, we showcase how we improved scalability by using topic partitioning in Kafka, reduced event throughput by \textbf{85\%} through the use of Avro schema and compression, decreased costs by \textbf{40\%}, and implemented a separate pipeline to ensure data correctness. Our findings provide valuable insights into the tradeoffs and complexities of real-time systems, enabling better-informed decisions tailored to specific requirements for building effective streaming systems that enhance user satisfaction.

翻译：历史上，机器学习训练流水线主要依赖批处理训练模型，每隔数小时重新训练模型。然而，工业实践已证明实时训练能够带来更具适应性和个性化的用户体验。从批处理转向实时处理需要在获取准确性与时效性优势的同时，权衡成本控制与系统可预测性及可维护性。本研究阐述了如何利用Apache Kafka与Flink将机器学习模型迁移至流式处理流水线。我们展示了如何从Google Pub/Sub过渡至Kafka以处理实时流入事件，并借助RocksDB与检查点机制运用Flink实现流式关联。同时，我们探讨了事件间因果依赖管理、事件时间与处理时间的平衡、精确一次与至少一次送达保证等挑战。此外，我们展示了如何通过Kafka主题分区提升可扩展性，利用Avro模式与压缩技术将事件吞吐量降低\textbf{85\%}，实现成本减少\textbf{40\%}，并通过构建独立流水线确保数据正确性。本研究为实时系统的权衡取舍与复杂性提供了重要见解，有助于根据特定需求做出更明智的决策，从而构建能有效提升用户满意度的流式处理系统。

相关内容

Kafka

关注 162

Kafka是一种高吞吐量的分布式发布订阅消息系统，它可以处理消费者规模的网站中的所有动作流数据。这种动作（网页浏览，搜索和其他用户的行动）是在现代网络上的许多社会功能的一个关键因素。这些数据通常是由于吞吐量的要求而通过处理日志和日志聚合来解决。对于像Hadoop的一样的日志数据和离线分析系统，但又要求实时处理的限制，这是一个可行的解决方案。Kafka的目的是通过Hadoop的并行加载机制来统一线上和离线的消息处理，也是为了通过集群来提供实时的消费。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日