Data provenance (the process of determining the origin and derivation of data outputs) has applications across multiple domains including explaining database query results and auditing scientific workflows. Despite decades of research, provenance tracing remains challenging due to computational costs and storage overhead. In streaming systems such as Apache Flink, provenance graphs can grow super-linearly with data volume, posing significant scalability challenges. Temporal provenance is a promising direction, attaching timestamps to provenance information, enabling time-focused queries without maintaining complete historical records. However, existing temporal provenance methods primarily focus on system-level debugging, leaving a gap in data management applications. This paper proposes an agenda that uses Temporal Interaction Networks (TINs) to represent temporal provenance efficiently. We demonstrate TINs' applicability across streaming systems, transportation networks, and financial networks. We classify data into discrete and liquid types, define five temporal provenance query types, and propose a state-based indexing approach. Our vision outlines research directions toward making temporal provenance a practical tool for large-scale dataflows.
翻译:数据溯源(即确定数据输出的来源与衍生过程)在多个领域具有重要应用,包括解释数据库查询结果与审计科学工作流。尽管经过数十年研究,由于计算成本与存储开销的限制,溯源追踪仍面临挑战。在Apache Flink等流式系统中,溯源图可能随数据量呈超线性增长,带来显著的可扩展性难题。时序溯源是一个前景广阔的研究方向,它通过为溯源信息附加时间戳,使得无需维护完整历史记录即可执行时间聚焦的查询。然而,现有时序溯源方法主要集中于系统级调试,在数据管理应用领域仍存在空白。本文提出一项研究议程,利用时序交互网络(TINs)高效表征时序溯源信息。我们论证了TINs在流式系统、交通网络与金融网络中的适用性。通过将数据划分为离散型与流动型两类,定义了五类时序溯源查询模式,并提出一种基于状态的索引方法。我们的研究愿景规划了将时序溯源发展为大规模数据流实用工具的研究路径。