This paper is a theoretical follow-up to the 2019 DBLog paper, which introduced a change-data-capture (CDC) mechanism for both backfilling a downstream system from a live source database and streaming its ongoing changes, while the source keeps accepting writes. DBLog reads the table in primary-key range scans (chunks) interleaved with the source change log: watermarks locate each chunk in log order, chunk rows become refresh events, and CDC events repair stale chunk observations. The mechanism requires no table lock, no pause in writes, and no global read transaction, and has since been adopted by Debezium and Apache Flink CDC. The 2019 paper described the mechanism operationally but did not formalize its correctness object. This paper formalizes that object: DBLog constructs a snapshot-equivalent replay certificate without a single physical snapshot read. The central formal object is a certified virtual cut: a finite evidence bundle whose certified replay reaches the same per-key state as the source at a chosen frontier on a chosen key scope. A virtual cut is extensional: replay equality at a frontier, not a physical snapshot read, and does not assert one source timestamp across chunk rows. The paper proves per-key replay equality for every wellformed DBLog run at its frontier and scope, and that an accepted certificate, evaluated against faithful source observation, witnesses such a run and yields a virtual cut. It also proves a source-side continuation result: on the same scope, an established cut advances to later frontiers by appending the scope-filtered faithful CDC segment committed in between. Each result is Conditional on premises stated where the result appears. Whole-table correctness, exactly-once delivery, sink-state convergence, and transfer to named deployments are not consequences of certificate acceptance alone. All proofs are machine-checked in Isabelle/HOL.
翻译:本文是2019年DBLog论文的理论续篇,该论文提出了一种变更数据捕获(CDC)机制,用于在源数据库持续接受写入的同时,既从活跃源数据库回填下游系统,又持续传输其增量变更。DBLog采用主键范围扫描(分块)方式读取数据表,并与源变更日志交错执行:水印标记每个分块在日志中的位置,分块行构成刷新事件,CDC事件则修正陈旧的分块观测结果。该机制无需表锁、无需暂停写入事务,也无需全局读取事务,此后已被Debezium和Apache Flink CDC采纳。2019年论文从操作层面描述了该机制,但未形式化其正确性目标。本文将该目标形式化:DBLog在没有单一物理快照读取的条件下构建了快照等价的回放证书。核心形式化对象是可认证虚拟切片:一组有限的证据束,其经认证的回放能在选定键域上达到与源数据库在选定时间前沿相同的逐键状态。虚拟切片具有外延性——它是在时间前沿上的回放等价性而非物理快照读取,且不要求分块行对应同一源时间戳。本文证明:每个格式正确的DBLog运行在其时间前沿和键域上满足逐键回放等价性;一个经接受的证书,若与忠实源观测对照验证,则能证成此类运行并产生虚拟切片。本文还证明了源端连续性结论:在相同键域上,通过在已建立切片的后续提交数据上附加经过键域筛选的忠实CDC段,可将该切片推进至更晚时间前沿。每项结论均以陈述处明示的前提假设为条件。全表正确性、精确一次交付、接收端状态收敛及向命名部署的迁移,并非仅凭证书接受即可保证的推论。所有证明均在Isabelle/HOL中完成机器验证。