Distributed Stream Processing Systems (DSPSs) form the backbone of real-time processing and analytics at ByteDance, where Apache Flink powers one of the largest production clusters worldwide. Ensuring resiliency, the ability to withstand and rapidly recover from failures, together with operational stability, which provides consistent and predictable performance under normal conditions, is essential for meeting strict Service Level Objectives (SLOs). However, achieving resiliency and stability in large-scale production environments remains challenging due to the cluster scale, business diversity, and significant operational overhead. In this work, we present StreamShield, a production-proven resiliency solution deployed in ByteDance's Flink clusters. Designed along complementary perspectives of the engine and cluster, StreamShield introduces key techniques to enhance resiliency, covering runtime optimization, fine-grained fault-tolerance, hybrid replication strategy, and high availability under external systems. Furthermore, StreamShield proposes a robust testing and deployment pipeline that ensures reliability and robustness in production releases. Extensive evaluations on a production cluster demonstrate the efficiency and effectiveness of techniques proposed by StreamShield.
翻译:分布式流处理系统(DSPSs)构成了字节跳动实时处理与分析的基础架构,其中Apache Flink驱动着全球规模最大的生产集群之一。为确保满足严格的服务等级目标(SLO),系统必须具备弹性(即在故障发生时能够承受并快速恢复的能力)与运行稳定性(即在正常条件下提供持续且可预测的性能)。然而,在大规模生产环境中,由于集群规模庞大、业务多样性显著以及运维开销巨大,实现弹性与稳定性仍面临严峻挑战。本文提出StreamShield——一个已在字节跳动Flink集群中部署并经过生产验证的弹性解决方案。该方案从计算引擎与集群管理两个互补维度进行设计,通过引入运行时优化、细粒度容错、混合复制策略及外部系统高可用性等关键技术来增强系统弹性。此外,StreamShield构建了稳健的测试与部署流水线,确保生产版本发布的可靠性与鲁棒性。在生产集群上的大量评估结果表明,StreamShield所提技术具备高效性与实效性。