University campuses host abundant but fragmented GPU resources whose voluntary sharing is blocked by a mismatch between revocable, autonomous ownership and migration mechanisms that assume stationary failure hazards, homogeneous interconnects, and unbounded transfer windows. We present ReclaimNet, a network-layer migration protocol suite that treats provider reclaim as a first-class contract rather than a failure case, combining three mechanisms: (i) reclaim-aware checkpoint scheduling that jointly adapts to time-varying departure hazards and contended bandwidth across co-resident jobs; (ii) volatility-aware destination selection integrating topology, survival probability, and notice-window feasibility; and (iii) deadline-aware migration traffic control with edge enforcement and a submillisecond TC BPF kill-switch. A two-month deployment on a 54-node heterogeneous campus testbed reduces work loss by 66% over Slurm preempt-and-requeue and 38% over pipeline-redundancy checkpointing, with 38% shorter downtime and under 3% degradation of background research traffic. The prototype is open-sourced at the anonymous repository https://anonymous.4open.science/r/ICNP2026-ReclaimNet/.
翻译:大学校园中存在着丰富但碎片化的GPU资源,其自愿共享机制因可撤销的自主所有权与假设静态故障风险、同质互连及无界迁移窗口的迁移机制不匹配而受阻。我们提出ReclaimNet——一套网络层迁移协议簇,将提供商回收视为首要契约而非故障场景,融合三种机制:(i) 回收感知的检查点调度,能联合适应随时间变化的撤离风险与共驻任务间的竞争带宽;(ii) 波动性感知的目的地选择策略,综合拓扑结构、生存概率与通知窗口可行性;(iii) 截止时间感知的迁移流量控制,包含边缘执行机制与亚毫秒级TC BPF终止开关。在54节点异构校园测试平台上部署两个月的结果表明,相较Slurm抢占重排队机制与流水线冗余检查点机制,工作损失分别降低66%和38%,停机时间缩短38%,且背景研究流量性能下降不超过3%。原型已在匿名仓库https://anonymous.4open.science/r/ICNP2026-ReclaimNet/ 开源。