Existing GPU spatial sharing systems face a three-way tradeoff: resource utilization, performance isolation, and semantic determinism. Hardware partitioning suffers from hardware under-utilization. Hardware multiplexing fails to avoid performance interference. Recently proposed software-based GPU kernel slicing reshapes floating-point reduction orders, destroying semantic determinism and inducing catastrophic token drift in generative models. We present CoGPU, a transparent spatial sharing system that resolves this trilemma. CoGPU introduces \emph{GPU coroutine}, a novel abstraction that enables logical-to-physical resource decoupling. By dynamically mapping immutable virtual contexts to mutable physical resource via lightweight cooperative migration, CoGPU enables extensible, workload-aware scheduling without altering kernel semantics. Evaluations demonstrate CoGPU simultaneously achieves high utilization, strong isolation, and absolute semantic determinism (guaranteeing zero token mismatch). In multi-tenant co-location, it improves training throughput by up to 79.2\% over temporal sharing and reduces P99 inference tail latency by 15.1\%. Its pluggable architecture supports custom policies; compared to the default policy, a \textsc{TPOT-FIRST} policy further reduces SLO violations by 21.2\% under dynamic traffic.
翻译:暂无翻译