Cohet: A CXL-Driven Coherent Heterogeneous Computing Framework with Hardware-Calibrated Full-System Simulation

Conventional heterogeneous computing systems built on PCIe interconnects suffer from inefficient fine-grained host-device interactions and complex programming models. In recent years, many proprietary and open cache-coherent interconnect standards have emerged, among which compute express link (CXL) prevails in the open-standard domain after acquiring several competing solutions. Although CXL-based coherent heterogeneous computing holds the potential to fundamentally transform the collaborative computing mode of CPUs and XPUs, research in this direction remains hampered by the scarcity of available CXL-supported platforms, immature software/hardware ecosystems, and unclear application prospects. This paper presents Cohet, the first CXL-driven coherent heterogeneous computing framework. Cohet decouples the compute and memory resources to form unbiased CPU and XPU pools which share a single unified and coherent memory pool. It exposes a standard malloc/mmap interface to both CPU and XPU compute threads, leaving the OS dealing with smart memory allocation and management of heterogeneous resources. To facilitate Cohet research, we also present a full-system cycle-level simulator named SimCXL, which is capable of modeling all CXL sub-protocols and device types. SimCXL has been rigorously calibrated against a real CXL testbed with various CXL memory and accelerators, showing an average simulation error of 3%. Our evaluation reveals that CXL.cache reduces latency by 68% and increases bandwidth by 14.4x compared to DMA transfers at cacheline granularity. Building upon these insights, we demonstrate the benefits of Cohet with two killer apps, which are remote atomic operation (RAO) and remote procedure call (RPC). Compared to PCIe-NIC design, CXL-NIC achieves a 5.5 to 40.2x speedup for RAO offloading and an average speedup of 1.86x for RPC (de)serialization offloading.

翻译：基于PCIe互连的传统异构计算系统存在细粒度主机-设备交互效率低下和编程模型复杂的问题。近年来，多种专有及开放的缓存一致性互连标准相继涌现，其中计算快速链路（CXL）在整合多个竞争方案后已成为开放标准领域的主流。尽管基于CXL的相干异构计算有望从根本上改变CPU与XPU的协同计算模式，但该方向的研究仍受限于可用CXL平台的稀缺性、不成熟的软硬件生态系统以及不明确的应用前景。本文提出Cohet——首个基于CXL驱动的相干异构计算框架。Cohet通过解耦计算与内存资源，构建了无偏倚的CPU与XPU计算池，并共享统一的内存池。该框架向CPU和XPU计算线程提供标准的malloc/mmap接口，由操作系统负责异构资源的智能内存分配与管理。为推进Cohet研究，我们同时开发了全系统周期级模拟器SimCXL，该模拟器能够建模所有CXL子协议和设备类型。SimCXL已通过搭载多种CXL内存与加速器的真实CXL测试平台进行严格校准，平均模拟误差为3%。实验评估表明，在缓存行粒度下，CXL.cache相较于DMA传输可降低68%的延迟并提升14.4倍的带宽。基于这些发现，我们通过远程原子操作（RAO）和远程过程调用（RPC）两个典型应用展示了Cohet的优势。与PCIe-NIC设计相比，CXL-NIC在RAO卸载上实现了5.5至40.2倍的加速，在RPC（反）序列化卸载上平均加速达1.86倍。