Distributed training is essential for scaling the training of large neural network models, such as large language models (LLMs), across thousands of GPUs. However, the complexity of distributed training programs makes them particularly prone to silent bugs, which do not produce explicit error signals but lead to incorrect training outcomes. Effectively detecting and localizing such silent bugs in distributed training is challenging. Common debugging practices based on monitoring training loss or gradient norm curves are indirect, inefficient, and provide no way to localize bugs. To address those challenges, we design and implement TTrace, the first systematic differential testing system for detecting and localizing silent bugs in distributed training. TTrace aligns intermediate tensors from distributed training with those from a trusted reference implementation. To properly compare the floating-point values in the corresponding tensors, we propose a novel mathematical analysis that provides a guideline for setting tolerances, enabling TTrace to distinguish bug-induced errors from numerical errors. Experimental results demonstrate that TTrace effectively detects 11 existing bugs and 3 new bugs in the widely used Megatron-LM framework, while requiring fewer than 10 lines of code changes. TTrace is effective in various training recipes, including low-precision recipes involving BF16 and FP8. Notably, a popular open-source training framework has already adopted the method proposed by TTrace in its development workflow.
翻译:分布式训练对于在数千个GPU上扩展大型神经网络模型(如大语言模型LLMs)的训练至关重要。然而,分布式训练程序的复杂性使其特别容易产生静默错误,这类错误不会产生显式的错误信号,但会导致错误的训练结果。在分布式训练中有效检测和定位此类静默错误具有挑战性。基于监控训练损失或梯度范数曲线的常见调试方法间接、低效,且无法定位错误。为应对这些挑战,我们设计并实现了TTrace,这是首个用于检测和定位分布式训练中静默错误的系统性差分测试系统。TTrace将分布式训练中的中间张量与可信参考实现中的张量进行对齐。为了正确比较对应张量中的浮点数值,我们提出了一种新颖的数学分析,为设置容差提供指导,使TTrace能够区分由错误引起的误差与数值误差。实验结果表明,TTrace在广泛使用的Megatron-LM框架中有效检测出11个现有错误和3个新错误,同时仅需少于10行的代码修改。TTrace在多种训练方案中均有效,包括涉及BF16和FP8的低精度方案。值得注意的是,一个流行的开源训练框架已在其开发工作流中采用了TTrace所提出的方法。