Large-scale AI/ML training systems depend on two assumptions that are rarely examined: (1) that checkpoints represent atomic snapshots of global training state, and (2) that infrastructure updates can be applied without inducing mixed-protocol cluster states. Both assumptions are instances of a deeper structural error: the Forward-In-Time-Only (FITO) category mistake, which confuses protocol convergence properties with temporal predicates. We formalize this confusion as a type error: the identification of a temporal snapshot $\mathsf{Snap}(t)$ with a convergence property $\mathsf{Conv}(\mathcal{P},e)$. We model checkpoint execution in a process-algebraic framework and prove that under asynchronous composition with crash-recovery failures, no temporal instant can serve as an atomicity boundary. We reformulate checkpoint inconsistency on an epoch lattice and show that atomicity is a measure-zero event whose complement grows exponentially with the number of independent persistence domains. We formalize mixed-epoch recovery as a type violation in the optimization algebra and show that the resulting update is not a valid step of any standard optimizer. For firmware fleet updates, we strengthen the known consensus-hardness result: atomic deployment requires not merely agreement but common knowledge of the epoch transition, which is strictly unattainable in asynchronous systems with unreliable communication. We conclude by sketching a bilateral convergence protocol, inspired by Open Atomic Ethernet, that achieves $\mathsf{Conv}(\mathcal{P},e)$ without requiring $\mathsf{Snap}(t)$ -- replacing the FITO assumption with constraint semantics.
翻译:大规模AI/ML训练系统依赖两个鲜被审视的假设:(1) 检查点代表全局训练状态的原子快照;(2) 基础设施更新可在不引发混合协议集群状态的条件下实施。这两个假设均源于更深层的结构错误——“仅向前时间”(FITO)范畴错误,其将协议收敛属性与时间谓词混为一谈。我们将这种混淆形式化为类型错误:将时间快照 $\mathsf{Snap}(t)$ 与收敛属性 $\mathsf{Conv}(\mathcal{P},e)$ 错误等同。我们在进程代数框架中对检查点执行过程建模,证明在包含崩溃恢复故障的异步组合条件下,任何时间瞬间均无法作为原子性边界。我们在周期格上重构检查点不一致性问题,揭示原子性属于测度为零的事件,其补集随独立持久化域数量呈指数增长。我们将混合周期恢复形式化为优化代数中的类型违例,证明由此产生的更新不构成任何标准优化器的有效步骤。针对固件集群更新,我们强化了已知的共识困难结论:原子部署不仅需要达成共识,更需具备周期转换的公共知识,而这在具有不可靠通信的异步系统中严格不可实现。最后,我们借鉴开放原子以太网思想,勾勒一种双向收敛协议,该协议可在无需 $\mathsf{Snap}(t)$ 的前提下实现 $\mathsf{Conv}(\mathcal{P},e)$——用约束语义替代FITO假设。