Data transformation correctness is a major challenge in data engineering: how to verify pipeline accuracy before deployment. Traditional methods involve costly iterative testing, data materialization, and manual error detection, due to the lack of formal approaches to reasoning about data granularity (grain), which can shift during transformations, causing issues like fan traps (metrics duplication) and chasm traps (data loss). We introduce the first formal, mathematical definition of grain, extending it from an informal concept in dimensional modeling to a universal, type-theoretic framework applicable to any data type. Encoding grain into the type system allows compile-time verification of transformation correctness, shifting validation from runtime. We define three core grain relations-equality, ordering, and incomparability-and prove a general grain inference theorem that computes the output grain of equi-joins from input grains using type-level operations. This covers all join scenarios, including comparable and incomparable keys. Together with inference rules for relational operations, this enables verification through schema analysis alone, at zero cost. Our approach allows engineers to verify that entire pipeline DAGs maintain correctness properties, detecting grain-related errors such as fan traps, chasm traps, and aggregation issues before data processing. It emphasizes the importance of grain, focusing on critical characteristics rather than all data details. We provide machine-checked formal proofs in Lean 4, reducing verification costs by 98-99%. Additionally, large language models can automatically generate correctness proofs, shifting human effort from proof writing to proof verification, thus democratizing formal methods in data engineering and supporting confident deployment of AI-generated pipelines with machine-checkable guarantees.
翻译:数据转换的正确性是数据工程中的一个主要挑战:如何在部署前验证管道的准确性。传统方法由于缺乏对数据粒度(grain)进行形式化推理的手段,需要进行昂贵的迭代测试、数据物化和手动错误检测。数据粒度在转换过程中可能发生变化,导致诸如扇形陷阱(指标重复)和深渊陷阱(数据丢失)等问题。我们首次提出了粒度的形式化数学定义,将其从维度建模中的非正式概念扩展为适用于任何数据类型的通用类型论框架。将粒度编码到类型系统中,可以在编译时验证转换的正确性,将验证从运行时提前。我们定义了三种核心粒度关系——相等、有序和不可比——并证明了一个通用的粒度推断定理,该定理使用类型级操作从输入粒度计算等值连接的输出粒度。这涵盖了所有连接场景,包括可比和不可比的键。结合关系操作的推断规则,仅通过模式分析即可实现零成本的验证。我们的方法使工程师能够验证整个管道有向无环图是否保持正确性属性,在数据处理前检测与粒度相关的错误,如扇形陷阱、深渊陷阱和聚合问题。该方法强调粒度的重要性,关注关键特征而非所有数据细节。我们在Lean 4中提供了机器可检查的形式化证明,将验证成本降低了98-99%。此外,大型语言模型可以自动生成正确性证明,将人类工作从证明编写转向证明验证,从而在数据工程中普及形式化方法,并支持以机器可检查的保证自信地部署AI生成的管道。