Tracking data lineage is important for data integrity, reproducibility, and debugging data science workflows. However, fine-grained lineage (i.e., at a cell level) is challenging to store, even for the smallest datasets. This paper introduces DSLog, a storage system that efficiently stores, indexes, and queries array data lineage, agnostic to capture methodology. A main contribution is our new compression algorithm, named ProvRC, that compresses captured lineage relationships. Using ProvRC for lineage compression result in a significant storage reduction over functions with simple spatial regularity, beating alternative columnar-store baselines by up to 2000x}. We also show that ProvRC facilitates in-situ query processing that allows forward and backward lineage queries without decompression - in the optimal case, surpassing baselines by 20x in query latency on random numpy pipelines.
翻译:追踪数据谱系对于数据完整性、可复现性以及调试数据科学工作流至关重要。然而,即使对于最小的数据集,存储细粒度谱系(即单元格级别)也颇具挑战。本文介绍了DSLog,一个高效存储、索引和查询数组数据谱系的存储系统,其设计与捕获方法无关。一个主要贡献是我们名为ProvRC的新型压缩算法,该算法可压缩捕获到的谱系关系。使用ProvRC进行谱系压缩,对于具有简单空间规律性的函数,能实现显著的存储空间缩减,优于替代的列式存储基线方法达2000倍。我们还证明,ProvRC支持无需解压即可进行前向与后向谱系查询的原地查询处理——在最优情况下,其在随机numpy流水线上的查询延迟超越基线方法达20倍。