Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
翻译:大型语言模型(LLMs)凭借其卓越能力正在引领人工智能产业的变革。训练这些模型需要大规模GPU集群和大量计算时间,由此引发的频繁故障显著增加了训练成本。尽管这一问题至关重要,该领域目前仍缺乏评估可靠性的有效指标。本研究提出了一种名为**训练开销比率**(TOR)的新型可靠性度量指标,用于评估容错性LLM训练系统的可靠性。TOR定义为系统最优训练时间与实际观测训练时间的比值,为用户评估在特定系统上训练LLM所需实际时间提供了实用工具。此外,本研究揭示了提升可靠性的关键因素,并针对实践中常见的各类故障类型推导出相应的TOR方程。