Charts, figures, and text derived from data play an important role in decision making, from data-driven policy development to day-to-day choices informed by online articles. Making sense of, or fact-checking, outputs means understanding how they relate to the underlying data. Even for domain experts with access to the source code and data sets, this poses a significant challenge. In this paper we introduce a new program analysis framework which supports interactive exploration of fine-grained I/O relationships directly through computed outputs, making use of dynamic dependence graphs. Our main contribution is a novel notion in data provenance which we call related inputs, a relation of mutual relevance or "cognacy" which arises between inputs when they contribute to common features of the output. Queries of this form allow readers to ask questions like "What outputs use this data element, and what other data elements are used along with it?". We show how Jonsson and Tarski's concept of conjugate operators on Boolean algebras appropriately characterises the notion of cognacy in a dependence graph, and give a procedure for computing related inputs over such a graph.
翻译:从数据驱动的政策制定到受在线文章影响的日常决策,由数据生成的图表、图形和文本在决策过程中发挥着重要作用。理解或验证这些输出结果,意味着需要厘清它们与底层数据之间的关联。即使对于能够获取源代码和数据集的专业人士而言,这仍然构成重大挑战。本文提出一种新型程序分析框架,该框架利用动态依赖图,支持通过计算输出直接对细粒度输入/输出关系进行交互式探索。我们的核心贡献是提出数据溯源领域的新概念——关联输入,即当多个输入数据共同影响输出的某些特征时,这些输入之间形成的相互关联性或"认知关系"。此类查询允许读者提出诸如"哪些输出使用了该数据元素?与其共同使用的其他数据元素是什么?"等问题。我们论证了Jonsson与Tarski在布尔代数上提出的共轭算子概念如何恰当地刻画依赖图中的认知关系,并给出了在此类图上计算关联输入的具体方法。