Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.
翻译:确定性对于大语言模型(LLM)训练的可复现性至关重要,但它通常会导致显著的性能损失。在广泛使用的注意力实现(如FlashAttention-3)中,确定性的反向传播相较于其非确定性版本可能导致高达37.9%的吞吐量下降,这主要是因为梯度累积操作必须被序列化以保证数值一致性。这种性能损失源于计算阶段与梯度规约阶段的调度策略欠佳,导致了严重的硬件利用率不足。为应对这一挑战,我们将确定性注意力的反向传播形式化为一个有向无环图(DAG)上的调度问题,并推导出能最小化关键路径长度的调度方案。基于此形式化,我们提出了DASH(面向高吞吐的确定性注意力调度),它包含两种互补的调度策略:(i)降序Q-Tile迭代,一种反向的查询块遍历方法,用于缩减因果注意力中的流水线停顿;(ii)移位调度,一种在我们的DAG模型内理论最优的调度方案,可减少全掩码和因果掩码下的流水线停顿。我们在NVIDIA H800 GPU上的实证评估表明,DASH显著缩小了确定性注意力的性能差距。所提出的策略将注意力反向传播的吞吐量相较于基线提升了最高1.28倍,极大地推进了可复现LLM训练的效率。我们的代码已在 https://github.com/SJTU-Liquid/deterministic-FA3 开源。