A very common task in data visualization is to plot many data points with some measured y-value as a function of fixed x-values. Uncertainties on the y-values are typically presented as vertical error bars that represent either a Frequentist confidence interval or Bayesian credible interval for each data point. Most of the time, these error bars represent a 68\% confidence/credibility level, which leads to the intuition that a model fits the data reasonably well if its prediction lies within the error bars of roughly two thirds of the data points. Unfortunately, this and other intuitions no longer work when the uncertainties of the data points are correlated. If the error bars only show the square root of diagonal elements of some covariance matrix with non-negligible off-diagonal elements, we simply do not have enough information in the plot to judge whether a drawn model line agrees well with the data or not. In this paper we will demonstrate this problem and discuss ways to add more information to the plots to make it easier to judge the agreement between the data and some model prediction in the plot, as well as glean some insight where the model might be deficient. This is done by explicitly showing the contribution of the first principal component of the uncertainties, and by displaying the conditional uncertainties of all data points.
翻译:数据可视化中一个极为常见的任务是将大量数据点绘制成以固定x值为自变量、实测y值为因变量的图形。y值的不确定度通常用垂直误差条表示,这些误差条代表每个数据点的频率派置信区间或贝叶斯可信区间。大多数情况下,这些误差条对应68%置信/可信水平,由此产生一种直觉:若模型的预测值落在约三分之二数据点误差条范围内,则可认为模型与数据拟合良好。遗憾的是,当数据点的不确定度存在关联时,这种直觉及其他类似判断将不再成立。如果误差条仅展示某个非对角元素不可忽略的协方差矩阵对角元素的平方根,我们便无法仅凭图形判断所绘模型曲线与数据的吻合程度。本文旨在论证此问题的严重性,并探讨如何在图形中补充更多信息,以更便捷地评估数据与模型预测的一致性,同时洞察模型可能存在的缺陷。所提出的方法包括显式展示不确定度的第一主成分贡献,以及呈现所有数据点的条件不确定度。