Notebooks provide an author-friendly environment for iterative development, modular execution, and easy sharing. Distributed workflows are increasingly being authored and executed in notebooks, yet sharing and reproducing them remains challenging. Even small code or parameter changes often force full end-to-end re-execution of the distributed workflow, limiting iterative development for such workloads. Current methods for improving notebook execution operate on single-node workflows, while optimization techniques for distributed workflows typically sacrifice reproducibility. We introduce NBRewind, a notebook kernel system for efficient, reproducible execution of distributed workflows in notebooks. NBRewind consists of two kernels--audit and repeat. The audit kernel performs incremental, cell-level checkpointing to avoid unnecessary re-runs; repeat reconstructs checkpoints and enables partial re-execution including notebook cells that manage distributed workflow. Both kernel methods are based on data-flow analysis across cells. We show how checkpoints and logs when packaged as part of standardized notebook specification improve sharing and reproducibility. Using real-world case studies we show that creating incremental checkpoints adds minimal overhead and enables portable, cross-site reproducibility of notebook-based distributed workflows on HPC systems.
翻译:笔记本为迭代开发、模块化执行和便捷分享提供了友好的创作环境。分布式工作流越来越多地在笔记本中编写和执行,但分享和复现它们仍具挑战性。即使是代码或参数的微小修改,也常导致分布式工作流的完整端到端重新执行,从而限制了此类工作负载的迭代开发效率。当前提升笔记本执行的方法主要作用于单节点工作流,而针对分布式工作流的优化技术通常牺牲了可复现性。我们提出NBRewind,一种用于在笔记本中高效、可复现地执行分布式工作流的笔记本内核系统。NBRewind由两个内核组成——审计内核与重复内核。审计内核执行增量式、单元格级别的检查点机制,以避免不必要的重新执行;重复内核重建检查点并支持部分重执行,包括管理分布式工作流的笔记本单元格。两种内核方法均基于跨单元格的数据流分析。我们展示了将检查点与日志打包为标准笔记本规范组成部分时,如何提升共享性与可复现性。通过真实案例研究,我们证明创建增量检查点仅带来极低开销,并能实现HPC系统上基于笔记本的分布式工作流的可移植、跨站点复现。