Compute Express Link (CXL) 3.0 and beyond allows the compute nodes of a cluster to share data with hardware cache coherence and at the granularity of a cache line. This enables shared-memory semantics for distributed computing, but introduces new resilience challenges: a node failure leads to the loss of the dirty data in its caches, corrupting application state. Unfortunately, the CXL specification does not consider processor failures. Moreover, when a component fails, the specification tries to isolate it and continue application execution; there is no attempt to bring the application to a consistent state. To address these limitations, this paper extends the CXL specification to be resilient to node failures, and to correctly recover the application after node failures. We call the system ReCXL. To handle the failure of nodes, ReCXL augments the coherence transaction of a write with messages that propagate the update to a small set of other nodes (i.e., Replicas). Replicas save the update in a hardware Logging Unit. Such replication ensures resilience to node failures. Then, at regular intervals, the Logging Units dump the updates to memory. Recovery involves using the logs in the Logging Units to bring the directory and memory to a correct state. Our evaluation shows that ReCXL enables fault-tolerant execution with only a 30% slowdown over the same platform with no fault-tolerance support.
翻译:Compute Express Link (CXL) 3.0及后续版本支持集群计算节点在硬件缓存一致性机制下以缓存行粒度共享数据。这为分布式计算提供了共享内存语义,但同时也引入了新的容错挑战:节点故障将导致其缓存中的脏数据丢失,进而破坏应用程序状态。遗憾的是,CXL规范未考虑处理器故障场景。此外,当组件发生故障时,规范仅尝试隔离故障组件并继续执行应用程序,并未尝试将应用程序恢复到一致状态。为突破这些限制,本文扩展了CXL规范,使其具备节点故障容错能力,并能在节点故障后正确恢复应用程序。我们将该系统命名为ReCXL。为应对节点故障,ReCXL在写入操作的相干事务中增加了消息传播机制,将更新数据同步至少量其他节点(即副本节点)。副本节点通过硬件日志单元保存更新数据,这种复制机制确保了节点故障的容错性。随后,日志单元会定期将更新数据转储至内存。恢复过程利用日志单元中的记录,将目录与内存恢复到正确状态。实验评估表明,ReCXL在实现容错执行的同时,相较于无容错支持的相同平台仅产生30%的性能开销。