The Ghost in the Datacenter: Link Flapping, Topology Knowledge Failures, and the FITO Category Mistake

Every link disconnection or flap in a datacenter corrupts the network's self-knowledge -- its graph. We call this corruption a ghost: a node that appears reachable but is not, a link that reports "up" but silently drops traffic, or an IP address that resolves to a partitioned machine. Ghosts arise at every scale -- chiplet-to-chiplet (PCIe, UCIe), GPU-to-GPU (NVLink, NVSwitch), node-to-node (Ethernet, Thunderbolt), and cluster-to-cluster (IP, BGP) -- because all these protocols inherit Shannon's forward-in-time-only (FITO) channel model and use Timeout And Retry (TAR) as their failure detector. TAR cannot distinguish "slow" from "dead," which is precisely the ambiguity that Fischer--Lynch--Paterson proved unresolvable in asynchronous systems. We survey the problem using production data from Meta (419 interruptions in 54 days of LLaMA 3 training), ByteDance (38,236 explicit and 5,948 implicit failures in three months), Google (TPUv4 optical circuit switching), and Alibaba (0.057% NIC--ToR link failures per month). At 2025 cluster scale (${\sim}3$ million GPUs, ${>}10$ million optical links), a link flap occurs every 48 seconds. We show that every existing mitigation -- Phi Accrual failure detectors, SWIM, BFD, OSPF/ISIS fast convergence, SmartNIC offload, lossless Ethernet (RoCE/PFC), and Kubernetes pod eviction -- still creates ghosts because each is fundamentally timeout-based. We connect ghosts to gray failures (Huang et al., HotOS 2017) and metastable failures (Bronson et al., HotOS 2021; validated across 22 failures at 11 organizations, OSDI 2022). We argue that Open Atomic Ethernet eliminates ghosts at the link layer through a Reliable Link Failure Detector, Perfect Information Feedback, triangle failover, and atomic token transfer -- making topology knowledge transactional.

翻译：数据中心中的每次链路断开或抖动都会破坏网络的自知——即其拓扑图。我们称这种破坏为幽灵：表现为可达但实际不可达的节点、报告“正常”却静默丢弃流量的链路，或解析至分区机器的IP地址。幽灵现象存在于所有尺度——小芯片间（PCIe、UCIe）、GPU间（NVLink、NVSwitch）、节点间（以太网、雷电接口）以及集群间（IP、BGP）——因为这些协议均继承了香农的仅前向时间（FITO）信道模型，并采用超时重传（TAR）作为故障检测机制。TAR无法区分“缓慢”与“死亡”，而这正是Fischer–Lynch–Paterson在异步系统中证明无解的模糊性问题。我们通过实际生产数据调研该问题：Meta（LLaMA 3训练54天内419次中断）、字节跳动（三个月内38,236次显性及5,948次隐性故障）、谷歌（TPUv4光路交换）与阿里巴巴（每月0.057%网卡–架顶交换机链路故障）。在2025年集群规模下（约300万GPU、超1000万光链路），每48秒即发生一次链路抖动。我们证明现有所有缓解方案——Phi累积故障检测器、SWIM、BFD、OSPF/ISIS快速收敛、智能网卡卸载、无损以太网（RoCE/PFC）及Kubernetes容器驱逐——仍会产生幽灵，因其本质上均基于超时机制。我们将幽灵现象与灰度故障（Huang等，HotOS 2017）及亚稳态故障（Bronson等，HotOS 2021；经11个机构22次故障验证，OSDI 2022）建立关联。我们论证开放原子以太网通过可靠链路故障检测器、完美信息反馈、三角故障切换及原子令牌传输，可在链路层消除幽灵——使拓扑知识具备事务性。