In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. We introduce two criteria--expected information gain and anomaly information gain--to provide guidance about analytic decision-making and ultimately to improve the practice of data analysis. Finally, we show how our framework can be used to characterize common situations in practical data analysis.
翻译:1977年,约翰·图基(John Tukey)指出,在探索性数据分析中,数据分析师会使用数据可视化等工具,将其预期与实际观测结果分离。与统计理论不同,数据分析中一个常被忽视的方面是:分析师必须通过将观测数据或统计工具的输出与其先前的数据预期进行比较来做出决策。然而,由于统计理论通常忽略对使用这些统计方法的主体(即分析师)的讨论,关于如何做出这些数据分析决策的正式指导十分匮乏。本文旨在将“分析师预期与数据可视化观测结果进行比较”这一基本思想推广至更通用的分析场景。我们基于分析师的预期,提出了一种数据分析迭代过程的模型,该模型使用了我们定义的“预期概率结果集”与“异常概率结果集”以及“统计信息增益”的概念。该模型假设:分析师的目标是通过连续的分析迭代,不断增加其相对于已有知识的掌握信息总量。我们引入了两个准则——预期信息增益与异常信息增益——为分析决策提供指导,并最终改进数据分析实践。最后,我们展示了该框架如何用于表征实际数据分析中的常见情形。