NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC setup. The traditional line of thinking is to design a system that is conceptually similar to transactional memory, where we log updates all the time, and minimize the wasted work or alternatively the MTTR (mean time to recovery). Such ``instant recovery'' systems allow the system to recover from a point that is quite close to the point of failure. The penalty that we pay is the prohibitive number of additional writes to the NVM. We propose a paradigmatically different approach in this paper, where we argue that in most practical settings such as regular HPC workloads or neural network training, there is no need for such instant recovery. This means that we can afford to lose some work, take periodic software-initiated checkpoints and still meet the goals of the application. The key benefit of our scheme is that we reduce write amplification substantially; this extends the life of NVMs by roughly the same factor. We go a step further and design an adaptive system that can minimize the WA given a target checkpoint latency, and show that our control algorithm almost always performs near-optimally. Our scheme reduces the WA by 2.3-96\% as compared to the nearest competing work.
翻译:基于NVM的系统天然适合采用周期性检查点(或快照)机制。这能提升系统可靠性,增强对电源故障的免疫力,并在高性能计算环境下减少无效工作。传统设计思路是构建一个概念上类似事务内存的系统,通过持续记录更新来最小化无效工作或平均恢复时间。这类"即时恢复"系统可从极接近故障点的位置恢复,但代价是向NVM写入大量额外数据。本文提出一种范式转变:在常规HPC负载或神经网络训练等实际场景中,即时恢复并非必要。这意味着我们可以容忍部分工作丢失,通过周期性软件触发的检查点仍能满足应用目标。本方案的核心优势在于大幅降低写入放大,从而按比例延长NVM寿命。我们进一步设计了自适应系统,能在给定目标检查点延迟条件下最小化写入放大,并证明该控制算法几乎总能接近最优性能。与最接近的现有方案相比,本方案将写入放大降低了2.3%-96%。