Charts, figures, and text derived from data play an important role in decision making, from data-driven policy development to day-to-day choices informed by online articles. Making sense of, or fact-checking, outputs means understanding how they relate to the underlying data. Even for domain experts with access to the source code and data sets, this poses a significant challenge. In this paper we introduce a new program analysis framework which supports interactive exploration of fine-grained I/O relationships directly through computed outputs, making use of dynamic dependence graphs. Our main contribution is a novel notion in data provenance which we call related inputs, a relation of mutual relevance or "cognacy" which arises between inputs when they contribute to common features of the output. Queries of this form allow readers to ask questions like "What outputs use this data element, and what other data elements are used along with it?". We show how Jonsson and Tarski's concept of conjugate operators on Boolean algebras appropriately characterises the notion of cognacy in a dependence graph, and give a procedure for computing related inputs over such a graph.
翻译:图表、文本等数据派生产物在决策中扮演重要角色——从数据驱动的政策制定到在线文章影响的日常选择。理解或核查这些输出结果,意味着需厘清其与底层数据的关联。即便对于能访问源代码和数据集的领域专家而言,这仍是重大挑战。本文提出一种新型程序分析框架,通过动态依赖图直接支持对计算输出中细粒度输入输出关系的交互式探索。我们的核心贡献在于数据溯源领域的新概念——关联输入,即当多个输入共同作用于输出特征时产生的互相关或“同源性”关系。此类查询允许读者提问:“哪些输出使用了该数据元素?与之共同使用的其他数据元素有哪些?”我们展示了Jonsson与Tarski关于布尔代数共轭算子的概念如何恰当地刻画依赖图中的同源性特征,并给出了在该类图上计算关联输入的有效方法。