理解大型语言模型检查点/恢复的I/O策略与模式 (Understanding LLM Checkpoint/Restore I/O Strategies and Patterns)

from arxiv, SCA/HPCAsia 2026 Workshops: Supercomputing Asia and International Conference on High Performance Computing in the Asia Pacific Region Workshops

As LLMs and foundation models scale, checkpoint/restore has become a critical pattern for training and inference. With 3D parallelism (tensor, pipeline, data), checkpointing involves many processes, each managing numerous tensors of varying shapes and sizes, that must be persisted frequently to stable storage (e.g., parallel file systems). This turns checkpoint/restore into a big-data I/O problem characterized by volume, variety, and velocity. The workflow must traverse the full storage stack -- from GPU memory through host memory and local storage to external repositories -- whose tiers differ by orders of magnitude in performance, creating bottlenecks under concurrency even with asynchronous flush/prefetch. Kernel-accelerated I/O libraries such as \texttt{liburing} may mitigate these issues versus POSIX, but their effectiveness for LLM checkpointing remains underexplored. We develop microbenchmarks to quantify trade-offs when using \texttt{liburing}, evaluating how aggregation, alignment, and I/O coalescing interact under buffered and direct I/O. We find that uncoalesced small-buffer operations halve throughput relative to synthetic workloads, while file system-aware aggregation restores bandwidth and reduces metadata overhead. Compared to state-of-the-art LLM checkpointing engines, our approach achieves up to $3.9\times$ higher write throughput than DataStates-LLM and $7.6\times$ higher than TorchSnapshot. These results highlight the need for aggregation and coalescing strategies that align with modern file systems and I/O backends.

翻译：随着大型语言模型和基础模型规模的扩大，检查点/恢复已成为训练和推理的关键模式。在三维并行（张量、流水线、数据并行）架构下，检查点操作涉及众多进程，每个进程管理着大量形状和尺寸各异的张量，这些数据需要频繁持久化到稳定存储（如并行文件系统）。这使得检查点/恢复演变为一个具有数据体量大、类型多、速度快的典型大数据I/O问题。工作流程必须穿越完整的存储层次——从GPU内存经主机内存和本地存储直至外部存储库——各层级性能存在数量级差异，即使在异步刷新/预取机制下，高并发场景仍会形成瓶颈。相较于POSIX接口，内核加速I/O库（如\texttt{liburing}）可能缓解这些问题，但其在LLM检查点场景的有效性尚未得到充分探索。我们开发了微基准测试来量化使用\texttt{liburing}时的性能权衡，评估缓冲I/O与直接I/O模式下聚合、对齐和I/O合并的交互影响。研究发现，未合并的小缓冲区操作会使吞吐量较合成工作负载下降一半，而文件系统感知的聚合策略能恢复带宽并降低元数据开销。与最先进的LLM检查点引擎相比，我们的方法实现了高达DataStates-LLM $3.9\times$的写入吞吐量提升，以及TorchSnapshot $7.6\times$的提升。这些结果凸显了需要与现代文件系统和I/O后端相匹配的聚合与合并策略。