A Spatially Correlated Competing Risks Time-to-Event Model for Supercomputer GPU Failure Data - 专知论文

会员服务 ·

0

GPU · 超级计算机 · 随机效应 · CVPR 2022 · 图形处理单元 ·

2023 年 3 月 29 日

A Spatially Correlated Competing Risks Time-to-Event Model for Supercomputer GPU Failure Data

翻译：一种面向超级计算机GPU故障数据的空间相关竞争风险时间事件模型

Jie Min,Yili Hong,William Q. Meeker,George Ostrouchov

from arxiv, 45 pages, 25 figures

Graphics processing units (GPUs) are widely used in many high-performance computing (HPC) applications such as imaging/video processing and training deep-learning models in artificial intelligence. GPUs installed in HPC systems are often heavily used, and GPU failures occur during HPC system operations. Thus, the reliability of GPUs is of interest for the overall reliability of HPC systems. The Cray XK7 Titan supercomputer was one of the top ten supercomputers in the world. The failure event times of more than 30,000 GPUs in Titan were recorded and previous data analysis suggested that the failure time of a GPU may be affected by the GPU's connectivity location inside the supercomputer among other factors. In this paper, we conduct in-depth statistical modeling of GPU failure times to study the effect of location on GPU failures under competing risks with covariates and spatially correlated random effects. In particular, two major failure types of GPUs in Titan are considered. The connectivity locations of cabinets are modeled as spatially correlated random effects, and the positions of GPUs inside each cabinet are treated as covariates. A Bayesian framework is used for statistical inference. We also compare different methods of estimation such as the maximum likelihood, which is implemented via an expectation-maximization algorithm. Our results provide interesting insights into GPU failures in HPC systems.

翻译：图形处理单元（GPU）广泛用于许多高性能计算（HPC）应用中，如图像/视频处理及人工智能深度学习模型的训练。HPC系统中安装的GPU常被高强度使用，并且在HPC系统运行过程中会发生GPU故障。因此，GPU的可靠性对HPC系统的整体可靠性至关重要。Cray XK7 Titan超级计算机曾是全球十大超级计算机之一。该系统中超过30,000个GPU的故障事件时间被记录，先前数据分析表明，GPU的故障时间可能受其在超级计算机内部的连接位置等因素影响。本文对GPU故障时间进行深入统计建模，研究在存在协变量和空间相关随机效应的竞争风险下，位置对GPU故障的影响。特别地，我们考虑了Titan中GPU的两种主要故障类型。机柜的连接位置被建模为空间相关随机效应，而每个机柜内GPU的位置则作为协变量处理。采用贝叶斯框架进行统计推断。我们还比较了不同估计方法，例如通过期望最大化算法实现的极大似然估计。我们的结果为HPC系统中的GPU故障提供了有价值的见解。

0

相关内容

GPU

KDD2022开会了！阿里最新《鲁棒时间序列分析与应用:工业前景》，97页ppt

KDD2022开会了！阿里最新《鲁棒时间序列分析与应用:工业前景》，97页ppt

专知会员服务

53+阅读 · 2022年8月15日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

128+阅读 · 2022年4月21日

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

专知会员服务

49+阅读 · 2022年2月19日

【数据科学导论书】Introduction to Datascience，253页pdf

【数据科学导论书】Introduction to Datascience，253页pdf

专知会员服务

50+阅读 · 2021年11月15日

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

54+阅读 · 2021年1月20日

多标签学习的新趋势（2020 Survey）

多标签学习的新趋势（2020 Survey）

专知会员服务

44+阅读 · 2020年12月6日

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

专知会员服务

31+阅读 · 2020年1月11日

【2020年AI趋势摘要：可嵌入、可迁移、可评价】《A Distilled List of AI Trends For 2020 - Towards Data Science》by Roberto Sannazzaro

【2020年AI趋势摘要：可嵌入、可迁移、可评价】《A Distilled List of AI Trends For 2020 - Towards Data Science》by Roberto Sannazzaro

专知会员服务

14+阅读 · 2019年12月20日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

灾难性遗忘问题新视角：迁移-干扰平衡

灾难性遗忘问题新视角：迁移-干扰平衡

CreateAMind

17+阅读 · 2019年7月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

专知

16+阅读 · 2018年5月14日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

数据中心资源利用率敏感的编译方法

国家自然科学基金

0+阅读 · 2015年12月31日

非局部总变差正则化图像恢复模型的快速子空间校正算法

国家自然科学基金

0+阅读 · 2014年12月31日

函数空间、几何和Mahler测度

国家自然科学基金

0+阅读 · 2014年12月31日

层次贝叶斯模型中隐性变量分布的非参数估计及在RNA-seq数据中的应用

国家自然科学基金

1+阅读 · 2013年12月31日

三维大地电磁自动建模与快速反演算法

国家自然科学基金

0+阅读 · 2013年12月31日

面向性能和可靠性的跨组织大数据部署方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

云计算环境下数据中心的power capping关键问题研究

国家自然科学基金

0+阅读 · 2012年12月31日

电磁场中的超导现象的数学问题及相关的偏微分方程组

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

非对易空间和非对易相空间中的量子物理

国家自然科学基金

0+阅读 · 2009年12月31日

Differentiable Model Selection for Ensemble Learning

Arxiv

0+阅读 · 2023年5月19日

An Adaptive Ensemble Framework for Addressing Concept Drift in IoT Data Streams

Arxiv

0+阅读 · 2023年5月18日

Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions

Arxiv

0+阅读 · 2023年5月18日

A Framework for Designing Foundation Model based Systems

Arxiv

1+阅读 · 2023年5月18日

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Arxiv

16+阅读 · 2022年3月25日

Scene Graph Generation: A Comprehensive Survey

Arxiv

26+阅读 · 2022年1月3日

Multimodality in Meta-Learning: A Comprehensive Survey

Arxiv

37+阅读 · 2021年9月28日

On Explainability of Graph Neural Networks via Subgraph Explorations

Arxiv

11+阅读 · 2021年5月31日

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Arxiv

12+阅读 · 2020年6月24日

A Survey of Model Compression and Acceleration for Deep Neural Networks

Arxiv

67+阅读 · 2019年9月8日

VIP会员

文章信息

相关主题

超级计算机

图形处理单元

最新内容

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

专知会员服务

1+阅读 · 今天14:45

综述 | 世界动作模型：少做梦，多行动

综述 | 世界动作模型：少做梦，多行动

专知会员服务

1+阅读 · 今天14:43

美以伊冲突：无人机与人工智能的运用

美以伊冲突：无人机与人工智能的运用

专知会员服务

3+阅读 · 今天14:31

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

专知会员服务

3+阅读 · 今天14:20

《特种部队在透明战场中的生存力》最新报告

《特种部队在透明战场中的生存力》最新报告

专知会员服务

2+阅读 · 今天14:11

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

专知会员服务

3+阅读 · 今天14:07

《人工智能生成的零日漏洞：对未来作战的影响》

《人工智能生成的零日漏洞：对未来作战的影响》

专知会员服务

3+阅读 · 今天14:03

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

专知会员服务

2+阅读 · 今天13:59

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

专知会员服务

5+阅读 · 6月22日

综述 | 3D场景图：开放挑战与未来方向

综述 | 3D场景图：开放挑战与未来方向

专知会员服务

8+阅读 · 6月22日

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

专知会员服务

7+阅读 · 6月22日

21世纪的无人机战争

21世纪的无人机战争

专知会员服务

4+阅读 · 6月22日

《伊朗与以色列-美国热战及其对数字技术的影响》

《伊朗与以色列-美国热战及其对数字技术的影响》

专知会员服务

5+阅读 · 6月22日

《量子技术的军事任务技术适配与利用》

《量子技术的军事任务技术适配与利用》

专知会员服务

5+阅读 · 6月22日

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

专知会员服务

8+阅读 · 6月22日

相关VIP内容

KDD2022开会了！阿里最新《鲁棒时间序列分析与应用:工业前景》，97页ppt

KDD2022开会了！阿里最新《鲁棒时间序列分析与应用:工业前景》，97页ppt

专知会员服务

53+阅读 · 2022年8月15日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

128+阅读 · 2022年4月21日

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

专知会员服务

49+阅读 · 2022年2月19日

【数据科学导论书】Introduction to Datascience，253页pdf

【数据科学导论书】Introduction to Datascience，253页pdf

专知会员服务

50+阅读 · 2021年11月15日

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

54+阅读 · 2021年1月20日

多标签学习的新趋势（2020 Survey）

多标签学习的新趋势（2020 Survey）

专知会员服务

44+阅读 · 2020年12月6日

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

专知会员服务

31+阅读 · 2020年1月11日

【2020年AI趋势摘要：可嵌入、可迁移、可评价】《A Distilled List of AI Trends For 2020 - Towards Data Science》by Roberto Sannazzaro

【2020年AI趋势摘要：可嵌入、可迁移、可评价】《A Distilled List of AI Trends For 2020 - Towards Data Science》by Roberto Sannazzaro

专知会员服务

14+阅读 · 2019年12月20日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 世界动作模型：少做梦，多行动

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

美以伊冲突：无人机与人工智能的运用

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

灾难性遗忘问题新视角：迁移-干扰平衡

灾难性遗忘问题新视角：迁移-干扰平衡

CreateAMind

17+阅读 · 2019年7月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

专知

16+阅读 · 2018年5月14日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Differentiable Model Selection for Ensemble Learning

Arxiv

0+阅读 · 2023年5月19日

An Adaptive Ensemble Framework for Addressing Concept Drift in IoT Data Streams

Arxiv

0+阅读 · 2023年5月18日

Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions

Arxiv

0+阅读 · 2023年5月18日

A Framework for Designing Foundation Model based Systems

Arxiv

1+阅读 · 2023年5月18日

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Arxiv

16+阅读 · 2022年3月25日

Scene Graph Generation: A Comprehensive Survey

Arxiv

26+阅读 · 2022年1月3日

Multimodality in Meta-Learning: A Comprehensive Survey

Arxiv

37+阅读 · 2021年9月28日

On Explainability of Graph Neural Networks via Subgraph Explorations

Arxiv

11+阅读 · 2021年5月31日

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Arxiv

12+阅读 · 2020年6月24日

A Survey of Model Compression and Acceleration for Deep Neural Networks

Arxiv

67+阅读 · 2019年9月8日

相关基金

数据中心资源利用率敏感的编译方法

国家自然科学基金

0+阅读 · 2015年12月31日

非局部总变差正则化图像恢复模型的快速子空间校正算法

国家自然科学基金

0+阅读 · 2014年12月31日

函数空间、几何和Mahler测度

国家自然科学基金

0+阅读 · 2014年12月31日

层次贝叶斯模型中隐性变量分布的非参数估计及在RNA-seq数据中的应用

国家自然科学基金

1+阅读 · 2013年12月31日

三维大地电磁自动建模与快速反演算法

国家自然科学基金

0+阅读 · 2013年12月31日

面向性能和可靠性的跨组织大数据部署方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

云计算环境下数据中心的power capping关键问题研究

国家自然科学基金

0+阅读 · 2012年12月31日

电磁场中的超导现象的数学问题及相关的偏微分方程组

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

非对易空间和非对易相空间中的量子物理

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员