Although modern, AI-centric datacenters heavily rely on SmartNICs, existing devices impose a hard trade-off. Commercial SmartNICs provide high bandwidth and easy software integration, but offer limited support for customization and data processing offload. In contrast, research SmartNICs often suffer from low bandwidth, limited functionality, and poor software compatibility -- to the point that many are not actual NICs in a technical sense. This gap can be closed by treating the NIC datapath as a first-class stream computation substrate with shared hardware/software abstractions for a tight co-design of infrastructure and applications. To demonstrate this, we introduce SCENIC, an open-source datacenter SmartNIC. SCENIC implements a 200G network datapath over offloaded TCP/IP and RDMA stacks, as well as a fallback path for processing arbitrary network traffic. On top of the network logic, SCENIC combines on-datapath Stream Compute Units (SCUs) for data processing and embedded ARM cores for flexible control path manipulation with direct access to GPUs and SSDs. SCENIC is fully integrated with the OS, exposing native Linux network and RDMA verb interfaces, making the programmable datapath transparent to existing applications while enabling control of, e.g., user-defined offloads and programmable congestion control. SCENIC's performance matches commercial platforms, and we show its versatility through several use cases such as offloaded collective communication and network-to-GPU hash-based data partitioning.
翻译:尽管现代以人工智能为中心的数据中心高度依赖智能网卡,但现有设备面临一个严峻的权衡困境。商业智能网卡虽具备高带宽和便捷的软件集成能力,但对定制化和数据处理卸载的支持有限。相比之下,研究型智能网卡常受限于低带宽、功能受限及软件兼容性差——许多甚至从技术角度而言并非真正的网卡。这一差距可通过将网卡数据路径视为头等流计算基础设施,并采用共享硬件/软件抽象以实现基础设施与应用紧密协同设计来弥合。为此,我们提出SCENIC——一款开源数据中心智能网卡。SCENIC基于卸载的TCP/IP和RDMA协议栈实现200G网络数据路径,同时提供用于处理任意网络流量的回退路径。在网络逻辑之上,SCENIC在数据路径上集成流计算单元(SCU)以进行数据处理,并嵌入ARM核心用于灵活的控制路径操作,同时支持直连GPU和SSD。SCENIC与操作系统深度集成,暴露原生Linux网络和RDMA动词接口,使可编程数据路径对现有应用透明,同时允许用户控制自定义卸载和可编程拥塞控制等功能。SCENIC的性能可媲美商业平台,我们通过多个用例(如卸载的集合通信和基于哈希的网络到GPU数据分区)展示了其多功能性。