GPU sharing is critical for maximizing hardware utilization in modern data centers. However, existing approaches present a stark trade-off: coarse-grained temporal multiplexing incurs severe tail-latency spikes for interactive services, while fine-grained spatial partitioning often necessitates invasive kernel modifications that compromise behavioral equivalence. We present DetShare, a novel GPU sharing system that prioritizes determinism and transparency. DetShare ensures semantic determinism (unmodified kernels yield identical results) and performance determinism (predictable tail latency), all while maintaining complete transparency (zero code modification). DetShare introduces GPU coroutines, a new abstraction that decouples logical execution contexts from physical GPU resources. This decoupling enables flexible, fine-grained resource allocation via lightweight context migration. Our evaluation demonstrates that DetShare improves training throughput by up to 79.2% compared to temporal sharing. In co-location scenarios, it outperforms state-of-the-art baselines, reducing P99 tail latency by 15.1% without compromising throughput. Furthermore, through workload-aware placement and our TPOT-First scheduling policy, DetShare decreases average inference latency by 69.1% and reduces Time-Per-Output-Token (TPOT) SLO violations by 21.2% relative to default policies.
翻译:GPU共享对于最大化现代数据中心硬件利用率至关重要。然而,现有方法存在明显的权衡:粗粒度时间复用会导致交互服务出现严重的尾部延迟尖峰,而细粒度空间划分通常需要对内核进行侵入式修改,从而损害行为等价性。我们提出了DetShare,一种优先考虑确定性与透明性的新型GPU共享系统。DetShare确保语义确定性(未经修改的内核产生相同结果)和性能确定性(可预测的尾部延迟),同时保持完全透明性(零代码修改)。DetShare引入了GPU协程这一新抽象,将逻辑执行上下文与物理GPU资源解耦。这种解耦通过轻量级上下文迁移实现了灵活、细粒度的资源分配。我们的评估表明,与时间共享相比,DetShare将训练吞吐量最高提升79.2%。在共置场景中,其性能优于最先进的基线方法,在不损失吞吐量的前提下将P99尾部延迟降低15.1%。此外,通过工作负载感知放置和我们的TPOT-First调度策略,相较于默认策略,DetShare将平均推理延迟降低69.1%,并将每输出令牌时间(TPOT)服务等级目标违反率减少21.2%。