While Compute Express Link (CXL) enables support for cache-coherent shared memory among multiple nodes, it also introduces new types of failures--processes can fail before data does, or data might fail before a process does. The lack of a failure model for CXL-based shared memory makes it challenging to understand and mitigate these failures. To solve these challenges, in this paper, we describe a model categorizing and handling the CXL-based shared memory's failures: data and process failures. Data failures in CXL-based shared memory render data inaccessible or inconsistent for a currently running application. We argue that such failures are unlike data failures in distributed storage systems and require CXL-specific handling. To address this, we look into traditional data failure mitigation techniques like erasure coding and replication and propose new solutions to better handle data failures in CXL-based shared memory systems. Next, we look into process failures and compare the failures and potential solutions with PMEM's failure model and programming solutions. We argue that although PMEM shares some of CXL's characteristics, it does not fully address CXL's volatile nature and low access latencies. Finally, taking inspiration from PMEM programming solutions, we propose techniques to handle these new failures. Thus, this paper is the first work to define the CXL-based shared memory failure model and propose tailored solutions that address challenges specific to CXL-based systems.
翻译:尽管计算快速链路(CXL)支持多节点间的缓存一致性共享内存,但它也引入了新型故障——进程可能在数据失效前崩溃,或数据可能在进程终止前损坏。由于缺乏基于CXL共享内存的故障模型,理解和缓解这些故障面临挑战。为解决这些问题,本文提出一种对基于CXL的共享内存故障进行分类处理的模型:数据故障与进程故障。在基于CXL的共享内存中,数据故障会导致运行中的应用程序无法访问数据或数据不一致。我们认为此类故障不同于分布式存储系统中的数据故障,需要针对CXL特性进行专门处理。为此,我们研究传统数据故障缓解技术(如纠删码和复制),并提出新解决方案以更好地处理基于CXL共享内存系统的数据故障。接着,我们探究进程故障,将其故障类型及潜在解决方案与持久内存(PMEM)的故障模型及编程方案进行对比。我们认为尽管PMEM与CXL具有部分相似特性,但未能完全应对CXL的易失性特征与低访问延迟需求。最后,借鉴PMEM编程方案,我们提出处理这些新型故障的技术方法。因此,本文首次定义了基于CXL的共享内存故障模型,并提出了针对CXL系统特有挑战的定制化解决方案。