Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, our aim is to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of our implementations to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.
翻译:成功的数据驱动科学需要复杂的数据工程管线来清洗、转换和修改数据以准备机器学习,而只有当管线中的每一步都能被证明合理,且其数据影响可被解释时,才能获得稳健结果。在此框架下,我们的目标是为数据科学家提供工具,使其能够深入理解管线中每个步骤如何影响数据——从原始输入到可用于学习的训练集。本文从数据科学场景中常用的可扩展数据准备操作集出发,提出一种溯源管理基础设施,用于生成、存储和查询数据变换的极细粒度记录(尽可能达到数据集内单个元素的级别)。进而,通过核心数据科学预处理操作集的正式定义,我们推导出由一系列以标准数据溯源模型PROV表达的模板所体现的溯源语义。以这些模板为参考,我们的溯源生成算法可泛化至任何具有可观测输入/输出对的操作。我们提供了应用级溯源捕获库的原型实现,能以半自动方式生成覆盖完整管线的完备溯源文档。通过实际机器学习基准管线和TCP-DI合成数据上的实验,验证了我们的实现捕获溯源的能力。最后,我们展示了收集的溯源数据如何用于回答一组支撑常见管线检查问题(如数据科学Stack Exchange平台所表述的)的基准查询。