Modern microservices increasingly depend on high-performance remote procedure calls (RPCs) to coordinate fine-grained, distributed computation. As network bandwidths continue to scale, the CPU overhead associated with RPC processing, particularly serialization, deserialization, and protocol handling, has become a critical bottleneck. This challenge is exacerbated by fast user-space networking stacks such as DPDK, which expose RPC processing as the dominant performance limiter. While prior work has explored software optimizations and FPGA-based offload engines, these approaches remain physically distant from the CPU's memory hierarchy, incurring unnecessary data movement and cache pollution. We present Arcalis, a near-cache RPC accelerator that positions a lightweight hardware engine adjacent to the last-level cache (LLC). Arcalis offloads RPC processing to dedicated microengines on receive and transmit paths that operate with cache-line latency while preserving programmability. By decoupling RPC processing logic, enabling microservice-specific execution, and positioning itself near the LLC to immediately consume data injected by network cards, Arcalis achieves 1.79-4.16$\times$ end-to-end speedup compared to the CPU baseline, while significantly reducing microarchitectural overhead by up to 88%, and achieves up to a 1.62$\times$ higher throughput than prior solutions. These results highlight the potential of near-cache RPC acceleration as a practical solution for high-performance microservice deployment.
翻译:现代微服务架构日益依赖高性能远程过程调用(RPC)来协调细粒度的分布式计算。随着网络带宽的持续提升,与RPC处理相关的CPU开销——特别是序列化、反序列化及协议处理——已成为关键性能瓶颈。这一挑战在用户态高速网络栈(如DPDK)中尤为突出,使得RPC处理成为主要性能限制因素。现有研究虽已探索软件优化及基于FPGA的卸载引擎,但这些方案仍位于CPU内存层次结构的物理远端,导致不必要的数据迁移和缓存污染。本文提出Arcalis,一种部署于末级缓存(LLC)旁的近缓存RPC加速器。Arcalis将RPC处理卸载至收发路径上的专用微引擎,这些微引擎在保持可编程性的同时以缓存行延迟运行。通过解耦RPC处理逻辑、支持微服务定制化执行,并借助近LLC部署直接处理网卡注入的数据,Arcalis相比CPU基线实现了1.79-4.16$\times$的端到端加速,同时将微架构开销降低达88%,并较现有方案获得最高1.62$\times$的吞吐量提升。这些结果表明近缓存RPC加速技术为高性能微服务部署提供了切实可行的解决方案。