We propose MatchRDMA, a proactive, segmented, and rate-matched long-haul RDMA scheme for geo-distributed LLM training over OTN. By coordinating source and destination OTN rates, it improves inter-DC throughput by up to 20x compared with conventional RDMA, and reduces destination-OTN buffer occupancy by up to 62.7%.
翻译:本文提出MatchRDMA,一种面向跨OTN地理分布式大语言模型训练的主动式、分段式、速率匹配长距RDMA方案。通过协调源端与目的端OTN速率,该方案相较于传统RDMA将跨数据中心吞吐量提升高达20倍,并将目的端OTN缓冲区占用率降低高达62.7%。