Histograms are among the most popular methods used in exploratory analysis to summarize univariate distributions. In particular, irregular histograms are good non-parametric density estimators that require very few parameters: the number of bins with their lengths and frequencies. Many approaches have been proposed in the literature to infer these parameters, either assuming hypotheses about the underlying data distributions or exploiting a model selection approach. In this paper, we focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without any user parameter and achieves state-of-the art performance w.r.t accuracy; parsimony and computation time. We investigate on the limits of this method in the case of outliers or heavy-tailed distributions. We suggest a two-level heuristic to deal with such cases. The first level exploits a logarithmic transformation of the data to split the data set into a list of data subsets with a controlled range of values. The second level builds a sub-histogram for each data subset and aggregates them to obtain a complete histogram. Extensive experiments show the benefits of the approach.
翻译:直方图是探索性分析中用于汇总单变量分布最常用的方法之一。特别地,非规则直方图作为优秀的非参数密度估计器,仅需极少量参数:区间数量及其长度与频率。文献中已提出多种推断这些参数的方法,有的基于对底层数据分布的假设,有的则采用模型选择策略。本文聚焦于G-Enum直方图方法,该方法利用最小描述长度原则构建无需用户参数直方图,在精度、简洁性与计算时间方面均达到当前最优性能。我们探究了该方法在遭遇离群值或重尾分布时的局限性,并提出一种应对此类情形双层启发式方法。第一层通过对数据进行对数变换,将数据集拆分为一系列数值范围可控的子集;第二层为每个数据子集构建子直方图并聚合生成完整直方图。大量实验验证了该方法的有效性。