PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation

Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is an OS-level GPU C/R system: It can transparently checkpoint or restore processes that use the GPU, without requiring any cooperation from the application, a key feature required by modern systems like the cloud. Moreover, POS is the first OS-level C/R system that can concurrently execute C/R with the application execution: a critical feature that can be trivially achieved when the processes only running on the CPU, but becomes challenging when the processes use GPU. The problem is how to ensure consistency during concurrent execution with the lack of application semantics due to transparency. CPU processes can leverage OS and hardware paging to fix inconsistency without application semantics. Unfortunately, GPU bypasses OS and paging for high performance. POS fills the semantic gap by speculatively extracting buffer access information of GPU kernels during runtime. Thanks to the simple and well-structured nature of GPU kernels, our speculative extraction (with runtime validation) achieves 100% accuracy on applications from training to inference whose domains span from vision, large language models, and reinforcement learning. Based on the extracted semantics, we systematically overlap C/R with application execution, and achieves orders of magnitude higher performance under various tasks compared with the state-of-the-art OS-level GPU C/R, including training fault tolerance, live GPU process migration, and cold starts acceleration in GPU-based serverless computing.

翻译：检查点（C）与恢复（R）是GPU任务的关键组件。POS是一个操作系统级GPU C/R系统：它能够透明地对使用GPU的进程进行检查点与恢复操作，无需应用层的任何配合——这是云等现代系统所需的关键特性。此外，POS是首个能够与应用程序执行并发运行C/R操作的操作系统级C/R系统：当进程仅运行于CPU时可轻松实现该关键特性，但当进程使用GPU时则面临挑战。问题在于如何在并发执行过程中确保一致性，且因透明性缺失无法获取应用语义。CPU进程可借助操作系统与硬件分页机制在不依赖应用语义的前提下解决一致性问题。遗憾的是，GPU为追求高性能绕过了操作系统与分页机制。POS通过运行时推测性提取GPU内核的缓冲区访问信息填补了语义鸿沟。得益于GPU内核简洁且结构化的特性，我们的推测性提取（配合运行时验证）在训练到推理的各类应用中实现了100%的准确率，应用领域涵盖视觉、大语言模型及强化学习。基于提取的语义，我们系统性实现了C/R与应用程序执行的重叠，在训练容错、实时GPU进程迁移及基于GPU的无服务器计算冷启动加速等任务中，较当前最先进的操作系统级GPU C/R方案实现了数量级的性能提升。