Lakehouses are the default cloud platform for analytics and AI, but they become unsafe when untrusted actors concurrently operate on production data: upstream-downstream mismatches surface only at runtime, and multi-table pipelines can leak partial effects. Inspired by software engineering, we design Bauplan, a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions. Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity. We report early results from a lightweight formal transaction model and discuss future work motivated by counterexamples.
翻译:湖仓已成为分析与人工智能的默认云平台,但当非受信参与方并发操作生产数据时,其安全性将受到威胁:上下游数据不匹配仅在运行时显现,多表流水线可能泄露部分处理结果。受软件工程思想启发,我们设计了Bauplan——一个代码优先的湖仓系统,旨在通过熟悉的抽象概念使(大多数)非法状态无法被表征。Bauplan沿三个维度展开:通过类型化表契约实现流水线边界可校验,采用类Git数据版本控制支持审查与复现,以及通过事务性运行保障流水线级原子性。我们基于轻量级形式化事务模型的初步结果进行报告,并探讨由反例驱动的未来研究方向。