ElasticNotebook: Enabling Live Migration for Computational Notebooks (Technical Report)

Computational notebooks (e.g., Jupyter, Google Colab) are widely used for interactive data science and machine learning. In those frameworks, users can start a session, then execute cells (i.e., a set of statements) to create variables, train models, visualize results, etc. Unfortunately, existing notebook systems do not offer live migration: when a notebook launches on a new machine, it loses its state, preventing users from continuing their tasks from where they had left off. This is because, unlike DBMS, the sessions directly rely on underlying kernels (e.g., Python/R interpreters) without an additional data management layer. Existing techniques for preserving states, such as copying all variables or OS-level checkpointing, are unreliable (often fail), inefficient, and platform-dependent. Also, re-running code from scratch can be highly time-consuming. In this paper, we introduce a new notebook system, ElasticNotebook, that offers live migration via checkpointing/restoration using a novel mechanism that is reliable, efficient, and platform-independent. Specifically, by observing all cell executions via transparent, lightweight monitoring, ElasticNotebook can find a reliable and efficient way (i.e., replication plan) for reconstructing the original session state, considering variable-cell dependencies, observed runtime, variable sizes, etc. To this end, our new graph-based optimization problem finds how to reconstruct all variables (efficiently) from a subset of variables that can be transferred across machines. We show that ElasticNotebook reduces end-to-end migration and restoration times by 85%-98% and 94%-99%, respectively, on a variety (i.e., Kaggle, JWST, and Tutorial) of notebooks with negligible runtime and memory overheads of <2.5% and <10%.

翻译：计算笔记本（如Jupyter、Google Colab）被广泛用于交互式数据科学与机器学习。在这些框架中，用户可以启动会话，通过执行代码单元（即一组语句）来创建变量、训练模型、可视化结果等。然而，现有笔记本系统不支持实时迁移：当笔记本在新机器上启动时，其状态会丢失，导致用户无法从先前断点继续任务。这是因为不同于数据库管理系统，其会话直接依赖底层内核（如Python/R解释器），缺乏独立的数据管理层。现有状态保存技术（如复制所有变量或操作系统级检查点）存在不可靠（常失败）、效率低、依赖平台等问题，而从零重新运行代码则极为耗时。本文提出新型笔记本系统ElasticNotebook，通过一种可靠、高效且平台无关的创新机制实现基于检查点/恢复的实时迁移。具体而言，ElasticNotebook通过透明轻量级监控所有代码单元执行过程，结合变量-代码依赖关系、运行时观测数据、变量规模等信息，自动寻找可靠高效的重构方案（即复制计划）重建原始会话状态。为此，我们提出新的基于图的优化问题，从可在机器间传输的变量子集中高效重构所有变量。实验表明，在Kaggle、JWST、Tutorial等多种笔记本上，ElasticNotebook能将端到端迁移与恢复时间分别降低85%-98%和94%-99%，运行时开销与内存开销均控制在2.5%和10%以内。