Lakehouses are the default cloud platform for analytics and AI, but they become unsafe when untrusted actors concurrently operate on production data: upstream-downstream mismatches surface only at runtime, and multi-table pipelines can leak partial effects. Inspired by software engineering, we design Bauplan, a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions. Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity. We report early results from a lightweight formal transaction model and discuss future work motivated by counterexamples.
翻译:湖仓已成为分析与人工智能的默认云平台,但当非受信参与者并发操作生产数据时,其安全性将受到威胁:上下游不匹配问题仅在运行时显现,多表流水线可能泄露部分处理结果。受软件工程思想启发,我们设计了Bauplan——一个代码优先的湖仓系统,旨在通过熟悉的抽象机制使(大多数)非法状态无法表达。Bauplan沿三个维度展开:通过类型化表契约实现流水线边界的可检查性,采用类Git数据版本控制以支持审查与复现,以及通过事务性运行保障流水线级别的原子性。我们报告了基于轻量级形式化事务模型的初步成果,并讨论了由反例驱动的未来研究方向。