Computational notebooks are notoriously prone to reproducibility failures. By permitting out-of-order cell execution, notebooks accumulate hidden state and implicit dependencies that cause interactive executions to silently diverge from clean top-to-bottom runs. Prior approaches either employ dependency analyses or enforce reactive dataflow models that face fundamental tradeoffs among expressiveness, precision, and performance. This paper exploits the insight that reproducibility can be enforced without precise dependency tracking: a notebook is reproducible if and only if executing its cells in top-to-bottom order from an empty store produces exactly the outputs currently recorded. We formalize this notion of reproducibility and present FlowBook, which implements a dynamic analysis that enforces reproducibility by tracking read and write sets at cell boundaries. FlowBook detects stale cells whose recorded outputs may no longer reflect the current notebook state and prevents operations that would violate reproducibility. FlowBook incurs near-imperceptible latency overhead (median: 70 ms).
翻译:计算笔记本以容易出现可重复性问题而闻名。由于允许乱序执行单元格,笔记本会累积隐藏状态和隐式依赖关系,导致交互式执行悄然偏离干净的自上而下运行。先前的方法要么采用依赖分析,要么强制执行响应式数据流模型,这些方法在表现力、精确性和性能之间面临基本权衡。本文揭示了无需精确依赖追踪即可强制执行可重复性的见解:当且仅当从空存储开始按自上而下顺序执行其单元格时,恰好产生当前记录的输出,笔记本才具有可重复性。我们将这一可重复性概念形式化,并提出FlowBook,它实现了一种动态分析,通过追踪单元格边界的读写集来强制执行可重复性。FlowBook会检测那些记录输出可能不再反映当前笔记本状态的过时单元格,并阻止会违反可重复性的操作。FlowBook引入了几乎难以察觉的延迟开销(中位数:70毫秒)。