Historically, machine learning training pipelines have predominantly relied on batch training models, retraining models every few hours. However, industrial practitioners have proved that real-time training can lead to a more adaptive and personalized user experience. The transition from batch to real-time is full of tradeoffs to get the benefits of accuracy and freshness while keeping the costs low and having a predictable, maintainable system. Our work characterizes migrating to a streaming pipeline for a machine learning model using Apache Kafka and Flink. We demonstrate how to transition from Google Pub/Sub to Kafka to handle incoming real-time events and leverage Flink for streaming joins using RocksDB and checkpointing. We also address challenges such as managing causal dependencies between events, balancing event time versus processing time, and ensuring exactly-once versus at-least-once delivery guarantees, among other issues. Furthermore, we showcase how we improved scalability by using topic partitioning in Kafka, reduced event throughput by \textbf{85\%} through the use of Avro schema and compression, decreased costs by \textbf{40\%}, and implemented a separate pipeline to ensure data correctness. Our findings provide valuable insights into the tradeoffs and complexities of real-time systems, enabling better-informed decisions tailored to specific requirements for building effective streaming systems that enhance user satisfaction.
翻译:历史上,机器学习训练流水线主要依赖批处理训练模型,每隔数小时重新训练模型。然而,工业实践已证明实时训练能够带来更具适应性和个性化的用户体验。从批处理转向实时处理需要在获取准确性与时效性优势的同时,权衡成本控制与系统可预测性及可维护性。本研究阐述了如何利用Apache Kafka与Flink将机器学习模型迁移至流式处理流水线。我们展示了如何从Google Pub/Sub过渡至Kafka以处理实时流入事件,并借助RocksDB与检查点机制运用Flink实现流式关联。同时,我们探讨了事件间因果依赖管理、事件时间与处理时间的平衡、精确一次与至少一次送达保证等挑战。此外,我们展示了如何通过Kafka主题分区提升可扩展性,利用Avro模式与压缩技术将事件吞吐量降低\textbf{85\%},实现成本减少\textbf{40\%},并通过构建独立流水线确保数据正确性。本研究为实时系统的权衡取舍与复杂性提供了重要见解,有助于根据特定需求做出更明智的决策,从而构建能有效提升用户满意度的流式处理系统。