The use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such accelerators are appearing, but there lack clean, portable OS abstractions for programming them. We propose a programming model for NDP devices based on familiar OS abstractions: virtual processors (processes) and inter-process communication channels (like Unix pipes). While appealing from a user perspective, a naive implementation of such abstractions is inappropriate for NDP accelerators: the paucity of processing power in some hardware designs makes classical processes overly heavyweight, and IPC based on shared buffers makes no sense in a system designed to reduce memory bandwidth. Accordingly, we show how to implement these abstractions in a lightweight and efficient manner by exploiting compilation and interconnect protocols. We demonstrate them with a real hardware platform runing applications with a range of memory access patterns, including bulk memory operations, in-memory databases and graph applications. Crucially, we show not only the benefits over CPU-only implementations, but also the critical importance of efficient, low-latency communication channels between CPU and NDP accelerators, a feature largely neglected in existing proposals.
翻译:对解聚式或远端内存系统(如CXL内存池)的应用,重新引发了人们对近数据处理(NDP)的关注:将计算核心部署在内存附近,以减少与CPU之间的带宽需求。此类加速器的硬件设计正在涌现,但缺乏清晰、可移植的操作系统抽象用于编程。我们提出了一种基于熟悉操作系统抽象(虚拟处理器(进程)和进程间通信通道(类似Unix管道))的NDP设备编程模型。尽管从用户角度看颇具吸引力,但对这些抽象的简单实现在NDP加速器上并不适用:某些硬件设计中处理能力的匮乏使得传统进程过于笨重,而基于共享缓冲区的进程间通信在旨在减少内存带宽的系统中毫无意义。因此,我们展示了如何通过利用编译器和互连协议,以轻量级且高效的方式实现这些抽象。我们通过一个真实的硬件平台进行演示,运行了多种内存访问模式的应用,包括大块内存操作、内存数据库和图应用。关键的是,我们不仅展示了相较于仅使用CPU实现方案的优势,还揭示了CPU与NDP加速器之间高效、低延迟通信通道的极端重要性——这一特性在现有方案中大多被忽略。