RDMA one-sided verbs are the natural primitive for memory disaggregation, but they require the client to supply the exact remote address. The 1-RTT performance breaks down when the target address depends on data that must first be read from remote memory, a pattern we call the Indirection Wall. Indirection is pervasive: graph traversals follow pointers hop by hop, address translation walks multi-level page tables, distributed coordination requires conditional multi-host logic, and disaggregated LLM inference must resolve paged KV caches through block-table lookups. Each level of indirection costs one sequentially dependent network round-trip, yet offloading to existing RDMA NICs either consumes remote CPU cycles or has limited throughput. We present Tiara, a compact, statically verifiable instruction set that executes on the memory-side NIC. Tiara operators are pre-registered programs, analogous to eBPF programs in the kernel, that resolve indirection locally, collapsing multi-RTT dependent chains into a single round-trip. On an FPGA-based prototype, Tiara reduces 10-hop graph-traversal latency by 2.85x over one-sided RDMA while sustaining 3.4x higher throughput, cuts page-table walk latency by 62%, reduces uncontended distributed-lock latency by 2.9x, achieves 2.8x throughput for disaggregated PagedAttention at 8 KB blocks, and 1.88x MoE expert-gather latency at 32 experts.
翻译:RDMA单边动词是实现内存解耦的自然原语,但要求客户端提供精确的远程地址。当目标地址依赖于必须首先从远程内存读取的数据时(即称为间接寻址墙的典型模式),1-RTT性能会急剧下降。间接寻址无处不在:图遍历逐跳跟踪指针,地址转换遍历多级页表,分布式协调需要条件化多主机逻辑,解耦的大语言模型推理必须通过块表查找解析分页KV缓存。每层间接寻址代价为一个顺序依赖的网络往返,而将任务卸载至现有RDMA网卡要么消耗远程CPU周期,要么吞吐量受限。本文提出Tiara——一种在内存端网卡上执行的紧凑型、可静态验证指令集。如同内核中的eBPF程序,Tiara操作符通过预注册程序实现本地间接寻址解析,将多RTT依赖链压缩为单个往返。在基于FPGA的原型系统中,相较于单边RDMA,Tiara将10跳图遍历延迟降低2.85倍,同时吞吐量提升3.4倍;页表遍历延迟降低62%;无竞争分布式锁延迟降低2.9倍;在8 KB分块大小的解耦PagedAttention中达到2.8倍吞吐量;在32专家的MoE专家聚合延迟上实现1.88倍性能提升。