Computational notebooks (e.g., Jupyter, Google Colab) are widely used for interactive data science and machine learning. In those frameworks, users can start a session, then execute cells (i.e., a set of statements) to create variables, train models, visualize results, etc. Unfortunately, existing notebook systems do not offer live migration: when a notebook launches on a new machine, it loses its state, preventing users from continuing their tasks from where they had left off. This is because, unlike DBMS, the sessions directly rely on underlying kernels (e.g., Python/R interpreters) without an additional data management layer. Existing techniques for preserving states, such as copying all variables or OS-level checkpointing, are unreliable (often fail), inefficient, and platform-dependent. Also, re-running code from scratch can be highly time-consuming. In this paper, we introduce a new notebook system, ElasticNotebook, that offers live migration via checkpointing/restoration using a novel mechanism that is reliable, efficient, and platform-independent. Specifically, by observing all cell executions via transparent, lightweight monitoring, ElasticNotebook can find a reliable and efficient way (i.e., replication plan) for reconstructing the original session state, considering variable-cell dependencies, observed runtime, variable sizes, etc. To this end, our new graph-based optimization problem finds how to reconstruct all variables (efficiently) from a subset of variables that can be transferred across machines. We show that ElasticNotebook reduces end-to-end migration and restoration times by 85%-98% and 94%-99%, respectively, on a variety (i.e., Kaggle, JWST, and Tutorial) of notebooks with negligible runtime and memory overheads of <2.5% and <10%.
翻译:计算型笔记本(如Jupyter、Google Colab)被广泛用于交互式数据科学与机器学习。在这些框架中,用户可以启动一个会话,然后执行单元(即一组语句)以创建变量、训练模型、可视化结果等。然而,现有的笔记本系统均未提供实时迁移功能:当笔记本在新机器上启动时,其状态会丢失,导致用户无法从之前中断的位置继续任务。这是因为与数据库管理系统不同,这些会话直接依赖于底层内核(如Python/R解释器),缺乏额外的数据管理层。现有的状态保存技术,例如复制所有变量或操作系统级别的检查点设置,存在不可靠(常失败)、效率低下且依赖平台的问题。此外,从头重新运行代码可能极为耗时。本文提出了一种新的笔记本系统ElasticNotebook,它通过一种新颖的检查点/恢复机制实现实时迁移,该机制具有可靠、高效且与平台无关的特性。具体而言,通过透明轻量的监控观察所有单元的执行过程,ElasticNotebook能够基于变量-单元依赖关系、观测到的运行时间、变量大小等因素,找到重建原始会话状态的可靠且高效的方法(即复制计划)。为此,我们提出了一种新的基于图的优化问题,以确定如何从可在机器间传输的变量子集中(高效地)重建所有变量。实验表明,在多种笔记本(包括Kaggle、JWST和Tutorial)上,ElasticNotebook将端到端迁移时间与恢复时间分别减少了85%-98%和94%-99%,同时运行时与内存开销可忽略不计(分别低于2.5%和10%)。