ElasticNotebook: Enabling Live Migration for Computational Notebooks

Computational notebooks (e.g., Jupyter, Google Colab) are widely used for interactive data science and machine learning. In those frameworks, users can start a session, then execute cells (i.e., a set of statements) to create variables, train models, visualize results, etc. Unfortunately, existing notebook systems do not offer live migration: when a notebook launches on a new machine, it loses its state, preventing users from continuing their tasks from where they had left off. This is because, unlike DBMS, the sessions directly rely on underlying kernels (e.g., Python/R interpreters) without an additional data management layer. Existing techniques for preserving states, such as copying all variables or OS-level checkpointing, are unreliable (often fail), inefficient, and platform-dependent. Also, re-running code from scratch can be highly time-consuming. In this paper, we introduce a new notebook system, ElasticNotebook, that offers live migration via checkpointing/restoration using a novel mechanism that is reliable, efficient, and platform-independent. Specifically, by observing all cell executions via transparent, lightweight monitoring, \system can find a reliable and efficient way (i.e., replication plan) for reconstructing the original session state, considering variable-cell dependencies, observed runtime, variable sizes, etc. To this end, our new graph-based optimization problem finds how to reconstruct all variables (efficiently) from a subset of variables that can be transferred across machines. We show that ElasticNotebook reduces end-to-end migration and restoration times by 85%-98% and 94%-99%, respectively, on a variety (i.e., Kaggle, JWST, and Tutorial) of notebooks with negligible runtime and memory overheads of <2.5% and <10%.

翻译：计算笔记本（例如Jupyter、Google Colab）广泛用于交互式数据科学和机器学习。在这些框架中，用户可以启动一个会话，然后执行单元（即一组语句）来创建变量、训练模型、可视化结果等。然而，现有的笔记本系统不支持实时迁移：当笔记本在新机器上启动时，它会丢失其状态，从而阻止用户从中断处继续任务。这是因为，与数据库管理系统不同，这些会话直接依赖底层内核（例如Python/R解释器），而没有额外的数据管理层。现有用于保存状态的技术，例如复制所有变量或操作系统级检查点，存在不可靠（经常失败）、效率低下且依赖平台的问题。此外，从头开始重新运行代码可能极其耗时。在本文中，我们介绍了一种新的笔记本系统ElasticNotebook，它通过一种可靠、高效且平台无关的新型检查点/恢复机制实现实时迁移。具体来说，通过透明、轻量级的监控观察所有单元执行，该系统能够找到一种可靠且高效的方式（即复制计划）来重建原始会话状态，同时考虑变量与单元的依赖关系、观察到的运行时、变量大小等因素。为此，我们提出的新基于图的优化问题旨在找出如何从可跨机器传输的变量子集高效重建所有变量。我们证明，ElasticNotebook在多种笔记本（如Kaggle、JWST和教程）上，将端到端迁移和恢复时间分别减少了85%-98%和94%-99%，同时运行时和内存开销可忽略不计，分别低于2.5%和10%。