Can we infer sources of errors from outputs of the complex data analytics software? Bidirectional programming promises that we can reverse flow of software, and translate corrections of output into corrections of either input or data analysis. This allows us to achieve holy grail of automated approaches to debugging, risk reporting and large scale distributed error tracking. Since processing of risk reports and data analysis pipelines can be frequently expressed using a sequence relational algebra operations, we propose a replacement of this traditional approach with a data summarization algebra that helps to determine an impact of errors. It works by defining data analysis of a necessarily complete summarization of a dataset, possibly in multiple ways along multiple dimensions. We also present a description to better communicate how the complete summarizations of the input data may facilitates easier debugging and more efficient development of analysis pipelines. This approach can also be described as an generalization of axiomatic theories of accounting into data analytics, thus dubbed data accounting. We also propose formal properties that allow for transparent assertions about impact of individual records on the aggregated data and ease debugging by allowing to find minimal changes that change behaviour of data analysis on per-record basis.
翻译:能否从复杂数据分析软件的输出中推断错误根源?双向编程承诺能够逆转软件的数据流,将输出的修正转化为对输入或数据分析的修正。这使我们能够实现自动化调试、风险报告和大规模分布式错误追踪的终极目标。由于风险报告处理和数据分析管道常可通过关系代数运算序列表达,我们提出用数据汇总代数替代传统方法,以协助确定错误影响范围。该代数通过定义数据集(可能沿多维度以多种方式)的必然完备汇总的数据分析机制运作。我们同时提出一种描述方法,以更清晰地阐明输入数据的完备汇总如何促进调试简化与分析管道开发效率提升。该方法可视为会计公理理论向数据分析领域的泛化,故称为数据核算。我们还提出了形式化属性,使得能够透明地断言单条记录对聚合数据的影响,并通过寻找改变数据分析行为的最小变更(基于逐记录粒度)来简化调试过程。