Every link disconnection or flap in a datacenter corrupts the network's self-knowledge -- its graph. We call this corruption a ghost: a node that appears reachable but is not, a link that reports "up" but silently drops traffic, or an IP address that resolves to a partitioned machine. Ghosts arise at every scale -- chiplet-to-chiplet (PCIe, UCIe), GPU-to-GPU (NVLink, NVSwitch), node-to-node (Ethernet, Thunderbolt), and cluster-to-cluster (IP, BGP) -- because all these protocols inherit Shannon's forward-in-time-only (FITO) channel model and use Timeout And Retry (TAR) as their failure detector. TAR cannot distinguish "slow" from "dead," which is precisely the ambiguity that Fischer--Lynch--Paterson proved unresolvable in asynchronous systems. We survey the problem using production data from Meta (419 interruptions in 54 days of LLaMA 3 training), ByteDance (38,236 explicit and 5,948 implicit failures in three months), Google (TPUv4 optical circuit switching), and Alibaba (0.057% NIC--ToR link failures per month). At 2025 cluster scale (${\sim}3$ million GPUs, ${>}10$ million optical links), a link flap occurs every 48 seconds. We show that every existing mitigation -- Phi Accrual failure detectors, SWIM, BFD, OSPF/ISIS fast convergence, SmartNIC offload, lossless Ethernet (RoCE/PFC), and Kubernetes pod eviction -- still creates ghosts because each is fundamentally timeout-based. We connect ghosts to gray failures (Huang et al., HotOS 2017) and metastable failures (Bronson et al., HotOS 2021; validated across 22 failures at 11 organizations, OSDI 2022). We argue that Open Atomic Ethernet eliminates ghosts at the link layer through a Reliable Link Failure Detector, Perfect Information Feedback, triangle failover, and atomic token transfer -- making topology knowledge transactional.
翻译:数据中心中的每次链路断开或抖动都会破坏网络的自知——即其拓扑图。我们称这种破坏为幽灵:表现为可达但实际不可达的节点、报告“正常”却静默丢弃流量的链路,或解析至分区机器的IP地址。幽灵现象存在于所有尺度——小芯片间(PCIe、UCIe)、GPU间(NVLink、NVSwitch)、节点间(以太网、雷电接口)以及集群间(IP、BGP)——因为这些协议均继承了香农的仅前向时间(FITO)信道模型,并采用超时重传(TAR)作为故障检测机制。TAR无法区分“缓慢”与“死亡”,而这正是Fischer–Lynch–Paterson在异步系统中证明无解的模糊性问题。我们通过实际生产数据调研该问题:Meta(LLaMA 3训练54天内419次中断)、字节跳动(三个月内38,236次显性及5,948次隐性故障)、谷歌(TPUv4光路交换)与阿里巴巴(每月0.057%网卡–架顶交换机链路故障)。在2025年集群规模下(约300万GPU、超1000万光链路),每48秒即发生一次链路抖动。我们证明现有所有缓解方案——Phi累积故障检测器、SWIM、BFD、OSPF/ISIS快速收敛、智能网卡卸载、无损以太网(RoCE/PFC)及Kubernetes容器驱逐——仍会产生幽灵,因其本质上均基于超时机制。我们将幽灵现象与灰度故障(Huang等,HotOS 2017)及亚稳态故障(Bronson等,HotOS 2021;经11个机构22次故障验证,OSDI 2022)建立关联。我们论证开放原子以太网通过可靠链路故障检测器、完美信息反馈、三角故障切换及原子令牌传输,可在链路层消除幽灵——使拓扑知识具备事务性。