RDMA link failures can render connections temporarily unavailable, causing both performance degradation and significant recovery overhead. To tolerate such failures, production datacenters assign each primary link with a standby link and, upon failure, uniformly retransmit all in-flight RDMA request over the backup path. However, we observe that such blanket retransmission is unnecessary. In-flight requests can be split into pre-failure and post-failure categories depending on whether the responder has already executed. Retransmitting post-failure requests is not only redundant (consuming bandwidth), but also incorrect for non-idempotent operations, where duplicate execution can violate application semantics. We present Varuna, a failure-type-aware RDMA recovery mechanism that enables correct retransmission and us-level failover. Varuna piggybacks a lightweight completion log on every RDMA operation; after a link failure, this log deterministically reveals which in-flight requests were executed (post-failure) and which were lost (pre-failure). Varuna then retransmits only the pre-failure subset and fetches/recovers the return values for post-failure requests. Evaluated using synthetic microbenchmarks and end-to-end RDMA TPC-C transactions, Varuna incurs only 0.6-10% steady-state latency overhead in realistic applications, eliminates 65% of recovery retransmission time, preserves transactional consistency, and introduces zero connectivity rebuild overhead and negligible memory overhead during RDMA failover.
翻译:RDMA链路故障可能导致连接暂时不可用,造成性能下降和显著的恢复开销。为了容忍此类故障,生产数据中心为每条主链路分配备用链路,并在故障发生时对所有在途RDMA请求统一通过备份路径进行重传。然而,我们发现这种统一重传是不必要的。在途请求可根据响应方是否已执行划分为故障前和故障后两类。重传故障后请求不仅冗余(消耗带宽),而且对于非幂等操作而言是错误的,因为重复执行可能违反应用语义。我们提出Varuna,一种故障类型感知的RDMA恢复机制,能够实现正确的重传和微秒级故障转移。Varuna在每个RDMA操作上附加轻量级的完成日志;链路故障后,该日志能确定性地揭示哪些在途请求已被执行(故障后)以及哪些已丢失(故障前)。Varuna随后仅重传故障前子集,并获取/恢复故障后请求的返回值。通过合成微基准测试和端到端RDMA TPC-C事务评估,Varuna在实际应用中仅引入0.6-10%的稳态延迟开销,消除了65%的恢复重传时间,保持了事务一致性,并在RDMA故障转移过程中实现了零连接重建开销和可忽略的内存开销。