While Compute Express Link (CXL) enables support for cache-coherent shared memory among multiple nodes, it also introduces new types of failures--processes can fail before data does, or data might fail before a process does. The lack of a failure model for CXL-based shared memory makes it challenging to understand and mitigate these failures. To solve these challenges, in this paper, we describe a model categorizing and handling the CXL-based shared memory's failures: data and process failures. Data failures in CXL-based shared memory render data inaccessible or inconsistent for a currently running application. We argue that such failures are unlike data failures in distributed storage systems and require CXL-specific handling. To address this, we look into traditional data failure mitigation techniques like erasure coding and replication and propose new solutions to better handle data failures in CXL-based shared memory systems. Next, we look into process failures and compare the failures and potential solutions with PMEM's failure model and programming solutions. We argue that although PMEM shares some of CXL's characteristics, it does not fully address CXL's volatile nature and low access latencies. Finally, taking inspiration from PMEM programming solutions, we propose techniques to handle these new failures. Thus, this paper is the first work to define the CXL-based shared memory failure model and propose tailored solutions that address challenges specific to CXL-based systems.
翻译:尽管计算快速链接(CXL)实现了多节点间缓存一致性共享内存的支持,它也引入了新型故障——进程可能在数据失效前崩溃,或数据可能在进程终止前损坏。由于缺乏基于CXL的共享内存故障模型,理解和缓解这些故障面临挑战。为解决这些问题,本文提出了一种分类处理基于CXL的共享内存故障的模型:数据故障与进程故障。基于CXL的共享内存中的数据故障会导致运行中的应用程序无法访问数据或数据不一致。我们认为此类故障与分布式存储系统中的数据故障存在本质差异,需要针对CXL特性设计处理方案。为此,我们研究了传统数据故障缓解技术(如纠删码和复制),并提出新解决方案以更好地处理基于CXL的共享内存系统中的数据故障。接着,我们探讨进程故障,将其与持久内存(PMEM)的故障模型及编程解决方案进行对比分析。我们认为,尽管PMEM与CXL具有部分相似特性,但未能完全应对CXL的易失性本质与低访问延迟特性。最后,借鉴PMEM编程解决方案,我们提出了处理这些新型故障的技术方法。因此,本文首次定义了基于CXL的共享内存故障模型,并提出了针对CXL系统特有挑战的定制化解决方案。