Consumer machines are increasingly running large ML workloads such as large language models (LLMs), text-to-image generation, and interactive image editing. Unlike datacenter GPUs, consumer GPUs serve single-user, rapidly changing workloads, and each model's working set often nearly fills the GPU memory. As a result, existing sharing mechanisms (e.g., NVIDIA Unified Virtual Memory) perform poorly due to memory thrashing and excessive use of CPU pinned memory when multiple applications are active. We design and implement Nixie, a system that enables efficient and transparent temporal multiplexing on consumer GPUs without requiring any application or driver changes. Nixie is a system service that coordinates GPU memory allocation and kernel launch behavior to efficiently utilize the CPU-GPU bi-directional bandwidth and CPU pinned memory. A lightweight scheduler in Nixie further improves responsiveness by automatically prioritizing latency-sensitive interactive jobs using MLFQ-inspired techniques. Our evaluations show that Nixie improves latency of real interactive code-completion tasks by up to $3.8\times$ and saves up to 66.8% CPU pinned memory usage given the same latency requirement.
翻译:消费级设备正日益运行大规模机器学习工作负载,如大语言模型(LLM)、文生图模型及交互式图像编辑。与数据中心GPU不同,消费级GPU需处理单用户快速变化的工作负载,且每个模型的工作集常近乎占满GPU显存。因此,当多个应用同时运行时,现有共享机制(如NVIDIA统一虚拟内存)会因内存抖动和CPU固定内存的过度使用而导致性能下降。我们设计并实现了Nixie系统,该系统能在无需修改应用程序或驱动程序的前提下,为消费级GPU实现高效透明的时分复用。Nixie作为系统服务,通过协调GPU内存分配与内核启动行为,高效利用CPU-GPU双向带宽与CPU固定内存。系统内轻量级调度器采用类MLFQ技术自动对延迟敏感的交互任务进行优先级调度,从而进一步提升响应速度。实验表明,在相同延迟要求下,Nixie可将实际交互式代码补全任务的延迟降低至最高$3.8\times$,并节省高达66.8%的CPU固定内存使用量。