In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. We introduce two criteria--expected information gain and anomaly information gain--to provide guidance about analytic decision-making and ultimately to improve the practice of data analysis. Finally, we show how our framework can be used to characterize common situations in practical data analysis.
翻译:1977年,约翰·图基(John Tukey)描述了探索性数据分析中,数据分析师如何通过数据可视化等工具,将其预期与实际观测结果相分离。与统计理论形成对比的是,数据分析中一个常被低估的方面在于:分析师必须通过比较观测数据或统计工具的输出结果与其先前的数据预期来做出决策。然而,由于统计理论通常忽略使用这些统计方法的个体差异,目前缺乏关于如何做出此类数据分析决策的正式指导。本文基于分析师预期,提出一个用于描述数据分析迭代过程的模型,该模型采用我们称之为"预期概率结果集"与"异常概率结果集"的概念,并结合统计信息增益的思想。我们将比较分析师预期与数据可视化观测结果这一基本思路,扩展至更通用的分析场景。该模型假设分析师的目标是通过连续的分析迭代,不断相对于已有知识增加信息量。我们提出两个准则——预期信息增益与异常信息增益——为分析决策提供指导,并最终改进数据分析实践。最后,我们展示了该框架如何用于刻画实际数据分析中的常见情境。