Lakehouses are now the default substrate for analytics and AI, but they remain fragile under concurrent, untrusted change: schema mismatches often surface only at runtime, development and production easily diverge, and multi-table pipelines can expose partial results after failure. We present Bauplan, a code-first lakehouse that aims to eliminate a broad class of these failures by construction. Bauplan builds on a storage substrate that already provides atomic single-table snapshot evolution, and adds three pipeline-level correctness mechanisms: typed table contracts to make transformation boundaries checkable, Git-like data versioning to support reproducible collaboration and review, and transactional runs that guarantee atomic publication of an entire pipeline execution. We describe the system design, show how these abstractions fit together into a unified programming model for humans and agents, and report early results from a lightweight Alloy model that both validates key intuitions and exposes subtle counterexamples around transactional branch visibility. Our experience suggests that correctness in the lakehouse is best addressed not by patching failures after the fact, but by restricting the programming model so that many illegal states become unrepresentable.
翻译:湖仓已成为分析与人工智能的默认基础设施,但在并发且不可信的变更下仍显脆弱:模式不匹配常在运行时才暴露,开发与生产环境极易偏离,多表流水线在故障后可能暴露部分结果。本文提出Bauplan,一种代码优先的湖仓系统,旨在通过构造方式消除广泛类别的此类故障。Bauplan基于已提供原子化单表快照演进的存储底层,并新增三项流水线级正确性机制:类型化表契约使转换边界可校验,类Git数据版本控制支持可复现的协作与评审,以及保证整个流水线执行原子化发布的事务性运行。我们阐述系统设计,展示这些抽象如何整合为面向人类与智能体的统一编程模型,并报告基于轻量级Alloy模型的初步结果——该模型既验证了关键设计直觉,也揭示了事务性分支可见性相关的微妙反例。我们的实践表明,湖仓的正确性不应通过事后修补故障来实现,而应通过限制编程模型使大量非法状态无法被表达。