StreamShield: A Production-Proven Resiliency Solution for Apache Flink at ByteDance - 专知论文

会员服务 ·

0

Apache Flink · 弹性 · 字节跳动 · 系统 · Apache ·

StreamShield: A Production-Proven Resiliency Solution for Apache Flink at ByteDance

翻译：StreamShield：字节跳动Apache Flink生产级弹性解决方案

Yong Fang,Yuxing Han,Meng Wang,Yifan Zhang,Yue Ma,Chi Zhang

Distributed Stream Processing Systems (DSPSs) form the backbone of real-time processing and analytics at ByteDance, where Apache Flink powers one of the largest production clusters worldwide. Ensuring resiliency, the ability to withstand and rapidly recover from failures, together with operational stability, which provides consistent and predictable performance under normal conditions, is essential for meeting strict Service Level Objectives (SLOs). However, achieving resiliency and stability in large-scale production environments remains challenging due to the cluster scale, business diversity, and significant operational overhead. In this work, we present StreamShield, a production-proven resiliency solution deployed in ByteDance's Flink clusters. Designed along complementary perspectives of the engine and cluster, StreamShield introduces key techniques to enhance resiliency, covering runtime optimization, fine-grained fault-tolerance, hybrid replication strategy, and high availability under external systems. Furthermore, StreamShield proposes a robust testing and deployment pipeline that ensures reliability and robustness in production releases. Extensive evaluations on a production cluster demonstrate the efficiency and effectiveness of techniques proposed by StreamShield.

翻译：分布式流处理系统（DSPSs）构成了字节跳动实时处理与分析的基础架构，其中Apache Flink驱动着全球规模最大的生产集群之一。为确保满足严格的服务等级目标（SLO），系统必须具备弹性（即在故障发生时能够承受并快速恢复的能力）与运行稳定性（即在正常条件下提供持续且可预测的性能）。然而，在大规模生产环境中，由于集群规模庞大、业务多样性显著以及运维开销巨大，实现弹性与稳定性仍面临严峻挑战。本文提出StreamShield——一个已在字节跳动Flink集群中部署并经过生产验证的弹性解决方案。该方案从计算引擎与集群管理两个互补维度进行设计，通过引入运行时优化、细粒度容错、混合复制策略及外部系统高可用性等关键技术来增强系统弹性。此外，StreamShield构建了稳健的测试与部署流水线，确保生产版本发布的可靠性与鲁棒性。在生产集群上的大量评估结果表明，StreamShield所提技术具备高效性与实效性。

0

相关内容

Apache Flink

《大规模供应链中断实时管理中智能决策支持系统的弹性集成》最新295页

《大规模供应链中断实时管理中智能决策支持系统的弹性集成》最新295页

专知会员服务

18+阅读 · 2025年5月9日

基于数字中台的军事物流数字化架构设计∗

基于数字中台的军事物流数字化架构设计∗

专知会员服务

33+阅读 · 2024年12月3日

《中国企业级SaaS产业发展研究报告（2024年）》

《中国企业级SaaS产业发展研究报告（2024年）》

专知会员服务

16+阅读 · 2024年8月15日

第六届未来网络发展大会《确定性网络技术发展与产业应用白皮书》

第六届未来网络发展大会《确定性网络技术发展与产业应用白皮书》

专知会员服务

36+阅读 · 2022年9月21日

【白皮书】中国移动：《5G确定性工业生产网白皮书》发布（附下载+PPT解读）

【白皮书】中国移动：《5G确定性工业生产网白皮书》发布（附下载+PPT解读）

专知会员服务

48+阅读 · 2022年9月5日

实时数据湖在字节跳动的实践

实时数据湖在字节跳动的实践

专知会员服务

30+阅读 · 2022年5月28日

重磅发布|《信息系统稳定性保障能力建设指南（1.0）》，附下载方式

重磅发布|《信息系统稳定性保障能力建设指南（1.0）》，附下载方式

专知会员服务

45+阅读 · 2022年4月11日

新基建产品手册: 人工智能、5G、车联网、数据中心等，72页pdf

新基建产品手册: 人工智能、5G、车联网、数据中心等，72页pdf

专知会员服务

101+阅读 · 2021年5月9日

FB大牛撰文推介，PySlowFast！Facebook开源视频理解前沿算法代码库，视频SOTA技术全在这了！

FB大牛撰文推介，PySlowFast！Facebook开源视频理解前沿算法代码库，视频SOTA技术全在这了！

专知会员服务

65+阅读 · 2020年1月6日

腾讯信息流内容理解技术实践，A User-Centered Concept Mining System for Query and Document Understanding at Tencent

腾讯信息流内容理解技术实践，A User-Centered Concept Mining System for Query and Document Understanding at Tencent

专知会员服务

41+阅读 · 2019年12月15日

【Flink】基于 Flink 的流式数据实时去重

【Flink】基于 Flink 的流式数据实时去重

AINLP

14+阅读 · 2020年9月29日

【数据中台】数据中台技术架构方案

【数据中台】数据中台技术架构方案

产业智能官

15+阅读 · 2020年5月26日

滴滴离线索引快速构建FastIndex架构实践

滴滴离线索引快速构建FastIndex架构实践

InfoQ

21+阅读 · 2020年3月19日

字节跳动AI高级产品经理田宇洲：AI产品经理需要掌握的核心算法

字节跳动AI高级产品经理田宇洲：AI产品经理需要掌握的核心算法

PMCAFF

24+阅读 · 2019年5月15日

亿级订单数据的访问与储存，怎么实现与优化

亿级订单数据的访问与储存，怎么实现与优化

ImportNew

11+阅读 · 2019年4月22日

亿级订单数据的访问与存储，怎么实现与优化？

亿级订单数据的访问与存储，怎么实现与优化？

码农翻身

16+阅读 · 2019年4月17日

【大数据】StreamSets：一个大数据采集工具

【大数据】StreamSets：一个大数据采集工具

产业智能官

40+阅读 · 2018年12月5日

阿里流行音乐趋势预测-深度学习LSTM网络实现代码分享

阿里流行音乐趋势预测-深度学习LSTM网络实现代码分享

机器学习研究会

11+阅读 · 2017年12月5日

教你用Flink实现超大规模用户行为分析（附代码、视频教程）

教你用Flink实现超大规模用户行为分析（附代码、视频教程）

THU数据派

12+阅读 · 2017年9月29日

今日头条推荐系统架构演进之路

今日头条推荐系统架构演进之路

QCon

32+阅读 · 2017年6月21日

面向下一代移动应用的移动云服务关键技术研究

国家自然科学基金

2+阅读 · 2017年12月31日

高动态方向性多跳自组网传输调度理论研究与实现

国家自然科学基金

1+阅读 · 2015年12月31日

多路径通信网络关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

非确定型Web服务流程重组的可靠性验证技术

国家自然科学基金

1+阅读 · 2015年12月31日

通信网络在不确定业务流量需求下的路由鲁棒性优化研究

国家自然科学基金

1+阅读 · 2015年12月31日

数据中心延迟敏感型应用尾端响应时延服务质量保障方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

动态自适应的可伸缩视频流媒体组播编码-传输联合优化

国家自然科学基金

0+阅读 · 2015年12月31日

弹性QoS的快速多目标优化软件定义卫星网络流控制方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

海量数据流实时分发技术研究

国家自然科学基金

3+阅读 · 2015年12月31日

千万自由度量级并行有限元模态和振动分析软件研发

国家自然科学基金

0+阅读 · 2014年12月31日

DynaFlow: Dynamics-embedded Flow Matching for Physically Consistent Motion Generation from State-only Demonstrations

Arxiv

0+阅读 · 3月16日

Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models

Arxiv

0+阅读 · 3月2日

FuxiShuffle: An Adaptive and Resilient Shuffle Service for Distributed Data Processing on Alibaba Cloud

Arxiv

0+阅读 · 2月26日

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Arxiv

0+阅读 · 2月18日

FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

Arxiv

0+阅读 · 2月17日

FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Arxiv

0+阅读 · 2月13日

Arcalis: Accelerating Remote Procedure Calls Using a Lightweight Near-Cache Solution

Arxiv

0+阅读 · 2月13日

ByteHouse: A Cloud-Native OLAP Engine with Incremental Computation and Multi-Modal Retrieval

Arxiv

0+阅读 · 2月9日

Morphis: SLO-Aware Resource Scheduling for Microservices with Time-Varying Call Graphs

Arxiv

0+阅读 · 2月3日

Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation

Arxiv

0+阅读 · 1月29日

VIP会员

文章信息

相关主题

最新内容

五角大楼新设无人机办公室（DRPM-UxS）将如何重塑美国无人系统格局（附美国防部设立备忘录）

五角大楼新设无人机办公室（DRPM-UxS）将如何重塑美国无人系统格局（附美国防部设立备忘录）

专知会员服务

0+阅读 · 6分钟前

印度精确打击与指挥架构的断层

印度精确打击与指挥架构的断层

专知会员服务

4+阅读 · 7月20日

《NASA喷气推进实验室：高耐久轻质常驻空观测系统（HELIOS）》429页

《NASA喷气推进实验室：高耐久轻质常驻空观测系统（HELIOS）》429页

专知会员服务

6+阅读 · 7月20日

美空军AI完成F-16战斗机自主空战历史性试飞

美空军AI完成F-16战斗机自主空战历史性试飞

专知会员服务

6+阅读 · 7月20日

《美政府问责局——武器系统年度评估（2026年）：强制要求成熟技术或可推动转向快速交付》249页

《美政府问责局——武器系统年度评估（2026年）：强制要求成熟技术或可推动转向快速交付》249页

专知会员服务

6+阅读 · 7月20日

《美国陆军：通过弹性分布式模型库实现自适应AI优势》

《美国陆军：通过弹性分布式模型库实现自适应AI优势》

专知会员服务

4+阅读 · 7月20日

博士论文 | 理解与改进大语言模型推理：从反转诅咒到连续思维链

博士论文 | 理解与改进大语言模型推理：从反转诅咒到连续思维链

专知会员服务

7+阅读 · 7月20日

综述 | 终身视觉表征：持续自监督学习CSSL系统综述

综述 | 终身视觉表征：持续自监督学习CSSL系统综述

专知会员服务

6+阅读 · 7月20日

深入Project Maven：为何人工智能在战场上依然失灵

深入Project Maven：为何人工智能在战场上依然失灵

专知会员服务

14+阅读 · 7月19日

锻造未来士兵：外骨骼、基因工程与赛博格

锻造未来士兵：外骨骼、基因工程与赛博格

专知会员服务

7+阅读 · 7月19日

《无人机系统（UAS）通信网状网络试验性部署》50页报告

《无人机系统（UAS）通信网状网络试验性部署》50页报告

专知会员服务

9+阅读 · 7月19日

《无人机蜂群通信技术研究》50页

《无人机蜂群通信技术研究》50页

专知会员服务

10+阅读 · 7月19日

《基于智能体建模与仿真的无人机蜂群模型目标定位涌现行为比较分析》360页

《基于智能体建模与仿真的无人机蜂群模型目标定位涌现行为比较分析》360页

专知会员服务

15+阅读 · 7月18日

欧洲智能弹药战略创新管理：迈向制导弹药、巡飞系统与自主无人机蜂群的技术主权研究路线图

欧洲智能弹药战略创新管理：迈向制导弹药、巡飞系统与自主无人机蜂群的技术主权研究路线图

专知会员服务

8+阅读 · 7月18日

从领域适配到部署与可解释：Berkeley博士论文解析大语言模型真实落地

从领域适配到部署与可解释：Berkeley博士论文解析大语言模型真实落地

专知会员服务

16+阅读 · 7月18日

相关VIP内容

《大规模供应链中断实时管理中智能决策支持系统的弹性集成》最新295页

《大规模供应链中断实时管理中智能决策支持系统的弹性集成》最新295页

专知会员服务

18+阅读 · 2025年5月9日

基于数字中台的军事物流数字化架构设计∗

基于数字中台的军事物流数字化架构设计∗

专知会员服务

33+阅读 · 2024年12月3日

《中国企业级SaaS产业发展研究报告（2024年）》

《中国企业级SaaS产业发展研究报告（2024年）》

专知会员服务

16+阅读 · 2024年8月15日

第六届未来网络发展大会《确定性网络技术发展与产业应用白皮书》

第六届未来网络发展大会《确定性网络技术发展与产业应用白皮书》

专知会员服务

36+阅读 · 2022年9月21日

【白皮书】中国移动：《5G确定性工业生产网白皮书》发布（附下载+PPT解读）

【白皮书】中国移动：《5G确定性工业生产网白皮书》发布（附下载+PPT解读）

专知会员服务

48+阅读 · 2022年9月5日

实时数据湖在字节跳动的实践

实时数据湖在字节跳动的实践

专知会员服务

30+阅读 · 2022年5月28日

重磅发布|《信息系统稳定性保障能力建设指南（1.0）》，附下载方式

重磅发布|《信息系统稳定性保障能力建设指南（1.0）》，附下载方式

专知会员服务

45+阅读 · 2022年4月11日

新基建产品手册: 人工智能、5G、车联网、数据中心等，72页pdf

新基建产品手册: 人工智能、5G、车联网、数据中心等，72页pdf

专知会员服务

101+阅读 · 2021年5月9日

FB大牛撰文推介，PySlowFast！Facebook开源视频理解前沿算法代码库，视频SOTA技术全在这了！

FB大牛撰文推介，PySlowFast！Facebook开源视频理解前沿算法代码库，视频SOTA技术全在这了！

专知会员服务

65+阅读 · 2020年1月6日

腾讯信息流内容理解技术实践，A User-Centered Concept Mining System for Query and Document Understanding at Tencent

腾讯信息流内容理解技术实践，A User-Centered Concept Mining System for Query and Document Understanding at Tencent

专知会员服务

41+阅读 · 2019年12月15日

热门VIP内容

开通专知VIP会员享更多权益服务

《NASA喷气推进实验室：高耐久轻质常驻空观测系统（HELIOS）》429页

《美政府问责局——武器系统年度评估（2026年）：强制要求成熟技术或可推动转向快速交付》249页

印度精确打击与指挥架构的断层

美空军AI完成F-16战斗机自主空战历史性试飞

相关资讯

【Flink】基于 Flink 的流式数据实时去重

【Flink】基于 Flink 的流式数据实时去重

AINLP

14+阅读 · 2020年9月29日

【数据中台】数据中台技术架构方案

【数据中台】数据中台技术架构方案

产业智能官

15+阅读 · 2020年5月26日

滴滴离线索引快速构建FastIndex架构实践

滴滴离线索引快速构建FastIndex架构实践

InfoQ

21+阅读 · 2020年3月19日

字节跳动AI高级产品经理田宇洲：AI产品经理需要掌握的核心算法

字节跳动AI高级产品经理田宇洲：AI产品经理需要掌握的核心算法

PMCAFF

24+阅读 · 2019年5月15日

亿级订单数据的访问与储存，怎么实现与优化

亿级订单数据的访问与储存，怎么实现与优化

ImportNew

11+阅读 · 2019年4月22日

亿级订单数据的访问与存储，怎么实现与优化？

亿级订单数据的访问与存储，怎么实现与优化？

码农翻身

16+阅读 · 2019年4月17日

【大数据】StreamSets：一个大数据采集工具

【大数据】StreamSets：一个大数据采集工具

产业智能官

40+阅读 · 2018年12月5日

阿里流行音乐趋势预测-深度学习LSTM网络实现代码分享

阿里流行音乐趋势预测-深度学习LSTM网络实现代码分享

机器学习研究会

11+阅读 · 2017年12月5日

教你用Flink实现超大规模用户行为分析（附代码、视频教程）

教你用Flink实现超大规模用户行为分析（附代码、视频教程）

THU数据派

12+阅读 · 2017年9月29日

今日头条推荐系统架构演进之路

今日头条推荐系统架构演进之路

QCon

32+阅读 · 2017年6月21日

相关论文

DynaFlow: Dynamics-embedded Flow Matching for Physically Consistent Motion Generation from State-only Demonstrations

Arxiv

0+阅读 · 3月16日

Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models

Arxiv

0+阅读 · 3月2日

FuxiShuffle: An Adaptive and Resilient Shuffle Service for Distributed Data Processing on Alibaba Cloud

Arxiv

0+阅读 · 2月26日

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Arxiv

0+阅读 · 2月18日

FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

Arxiv

0+阅读 · 2月17日

FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Arxiv

0+阅读 · 2月13日

Arcalis: Accelerating Remote Procedure Calls Using a Lightweight Near-Cache Solution

Arxiv

0+阅读 · 2月13日

ByteHouse: A Cloud-Native OLAP Engine with Incremental Computation and Multi-Modal Retrieval

Arxiv

0+阅读 · 2月9日

Morphis: SLO-Aware Resource Scheduling for Microservices with Time-Varying Call Graphs

Arxiv

0+阅读 · 2月3日

Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation

Arxiv

0+阅读 · 1月29日

相关基金

面向下一代移动应用的移动云服务关键技术研究

国家自然科学基金

2+阅读 · 2017年12月31日

高动态方向性多跳自组网传输调度理论研究与实现

国家自然科学基金

1+阅读 · 2015年12月31日

多路径通信网络关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

非确定型Web服务流程重组的可靠性验证技术

国家自然科学基金

1+阅读 · 2015年12月31日

通信网络在不确定业务流量需求下的路由鲁棒性优化研究

国家自然科学基金

1+阅读 · 2015年12月31日

数据中心延迟敏感型应用尾端响应时延服务质量保障方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

动态自适应的可伸缩视频流媒体组播编码-传输联合优化

国家自然科学基金

0+阅读 · 2015年12月31日

弹性QoS的快速多目标优化软件定义卫星网络流控制方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

海量数据流实时分发技术研究

国家自然科学基金

3+阅读 · 2015年12月31日

千万自由度量级并行有限元模态和振动分析软件研发

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员