The emerging microservice/serverless-based cloud programming paradigm and the rising networking speeds leave the RPC stack as the predominant data center tax. Domain-specific hardware acceleration holds the potential to disentangle the overhead and save host CPU cycles. However, state-of-the-art RPC accelerators integrate RPC logic into the CPU or use specialized low-latency interconnects, hardly adopted in commodity servers. To this end, we design and implement RPCAcc, a software-hardware co-designed RPC on-NIC accelerator that enables reconfigurable RPC kernel offloading. RPCAcc connects to the server through the most widely used PCIe interconnect. To grapple with the ramifications of PCIe-induced challenges, RPCAcc introduces three techniques:(a) a target-aware deserializer that effectively batches cross-PCIe writes on the accelerator's on-chip memory using compacted hardware data structures; (b) a memory-affinity CPU-accelerator collaborative serializer, which trades additional host memory copies for slow cross-PCIe transfers; (c) an automatic field update technique that transparently codifies the schema based on dynamic reconfigure RPC kernels to minimize superfluous PCIe traversals. We prototype RPCAcc using the Xilinx U280 FPGA card. On HyperProtoBench, RPCAcc achieves 3.2X lower serialization time than a comparable RPC accelerator baseline and demonstrates up to 2.6X throughput improvement in the end-to-end cloud workload.
翻译:新兴的基于微服务/无服务器云编程范式与不断提升的网络速度,使得RPC协议栈成为数据中心的主要性能瓶颈。领域专用硬件加速具备化解此开销并节省主机CPU周期的潜力。然而,现有先进的RPC加速器或将RPC逻辑集成至CPU内部,或依赖专用的低延迟互连技术,难以在商用服务器中广泛部署。为此,我们设计并实现了RPCAcc——一种软硬件协同设计的、位于网卡侧的RPC加速器,支持可重构的RPC内核卸载。RPCAcc通过应用最广泛的PCIe互连与服务器连接。为应对PCIe引入的系列挑战,RPCAcc提出了三项关键技术:(a) 目标感知反序列化器,利用紧凑的硬件数据结构在加速器片内内存中高效批处理跨PCIe写操作;(b) 内存亲和性CPU-加速器协同序列化器,通过额外的主机内存拷贝换取缓慢的跨PCIe数据传输;(c) 自动字段更新技术,基于动态可重构RPC内核透明编码数据模式,以最小化冗余的PCIe遍历。我们使用Xilinx U280 FPGA板卡实现了RPCAcc原型。在HyperProtoBench测试中,RPCAcc的序列化时间比同类RPC加速器基线降低3.2倍,并在端到端云工作负载中实现了最高2.6倍的吞吐量提升。