Modern data center applications increasingly require microsecond-scale service time with strict tail latency requirements, which can hardly be realized with existing in-network task schedulers due to their inherent limitations. Specifically, software-based schedulers struggle to balance throughput and latency, while switch-based designs either lack global coordination, rely on packet recirculation heavily, or only offer limited support for large tasks. In light of these restrictions of the state-of-the-arts (SOTAs), we, in this work, propose Rain, an RDMA-assisted in-network scheduler built atop programmable switches that maintains centralized queues while bounding worker-local queues. Rain introduces a bidirectional on-switch queuing mechanism to buffer and match tasks and worker-issued tokens directly in the switch, avoiding worker-side polling and approximating the optimal behavior of join-bounded-shortest-queue without global aggregation. A switch-driven RDMA engine pre-writes arbitrarily large tasks via one-sided WRITE multicasts, keeping only compact metadata on the switch. Slice-aware scheduling further localizes decisions to more homogeneous queues, reducing dispersion-induced head-of-line blocking. Moreover, our study reveals that real-world systems can diverge from theoretical predictions: shallower worker queues do not always improve tail latency. Leveraging this insight, Rain incorporates an adaptive scheduling strategy to optimize worker queue depths and worker-to-slice mappings at runtime. Evaluations with the real-world application RocksDB show that Rain achieves 1.75x higher throughput than the best-performing SOTA while satisfying the same tail latency requirement.
翻译:现代数据中心应用日益要求微秒级服务时间与严格的尾延迟约束,现有网内任务调度器因其固有局限性难以实现该目标。具体而言,基于软件的调度器难以平衡吞吐量与延迟,而基于交换机的设计要么缺乏全局协调能力,要么严重依赖数据包重循环,或仅对大型任务提供有限支持。针对现有技术的上述限制,本文提出Rain——一种基于可编程交换机、借助RDMA的网内调度器,能在维持集中式队列的同时限制工作节点本地队列长度。Rain引入双向交换机内排队机制,直接在交换机中缓存任务并与工作节点发出的令牌进行匹配,避免工作节点侧轮询并逼近联合最短队列的理论最优行为,无需全局聚合。基于交换机驱动的RDMA引擎通过单边WRITE多播预写任意大型任务,仅在交换机中保留紧凑元数据。感知切片的调度进一步将决策定位至更均匀的队列,降低分散性引发的队头阻塞。此外,我们的研究表明真实系统可能偏离理论预测:较浅的工作节点队列并非总能改善尾延迟。基于这一洞见,Rain采用自适应调度策略,在运行时优化工作节点队列深度及工作节点到切片的映射。基于真实应用RocksDB的评估表明:在满足相同尾延迟约束条件下,Rain的吞吐量较性能最优的现有技术提升1.75倍。