Data provenance (the process of determining the origin and derivation of data outputs) has applications across multiple domains including explaining database query results and auditing scientific workflows. Despite decades of research, provenance tracing remains challenging due to its high computational cost and storage requirements. In streaming systems such as Apache Flink, fine-grained provenance graphs can grow super-linearly with data volume, posing significant scalability challenges. We define temporal attribution, a new lightweight form of provenance, appropriate for certain tasks, such as monitoring dependencies between system components over time quantitatively. Temporal attribution enables time-focused analysis that does not require fine-grained, tuple-level dependency meta-data. Inspired by volume-based provenance tracking in Temporal Interaction Networks (TINs), we demonstrate TINs' applicability in succinctly modeling quantified data exchanges between dataflow operators in stream data processing systems and in processing workflows, in general, over time. We classify data into discrete and liquid types, define five temporal provenance query types, and propose a state-based indexing approach. Our vision outlines research directions toward making this new form of temporal attribution a practical tool for large-scale dataflow analytics.
翻译:数据溯源(确定数据输出的来源与推导过程的技术)在多个领域具有应用价值,包括解释数据库查询结果和审计科学工作流。尽管已有数十年研究,但由于高昂的计算成本与存储需求,溯源追踪仍面临挑战。在Apache Flink等流处理系统中,细粒度溯源图的规模可能随数据量呈超线性增长,带来显著的可扩展性问题。我们定义了时间归因——一种适用于特定任务的新型轻量级溯源形式,例如定量监控系统组件间随时间变化的依赖关系。时间归因支持面向时间的分析,无需细粒度的元组级依赖元数据。受时序交互网络(TINs)中基于数据量的溯源追踪方法启发,我们展示了TINs在流数据处理系统及处理工作流中,对数据流算子间量化数据交换进行简洁建模的通用性。我们将数据分为离散型与液态型,定义了五类时间溯源查询,并提出基于状态的索引方法。本文勾勒了将这种新型时间归因发展为大规模数据流分析实用工具的研究方向。