Multivariate histograms are difficult to construct due to the curse of dimensionality. Motivated by $k$-d trees in computer science, we show how to construct an efficient data-adaptive partition of Euclidean space that possesses the following two properties: With high confidence the distribution from which the data are generated is close to uniform on each rectangle of the partition; and despite the data-dependent construction we can give guaranteed finite sample simultaneous confidence intervals for the probabilities (and hence for the average densities) of each rectangle in the partition. This partition will automatically adapt to the sizes of the regions where the distribution is close to uniform. The methodology produces confidence intervals whose widths depend only on the probability content of the rectangles and not on the dimensionality of the space, thus avoiding the curse of dimensionality. Moreover, the widths essentially match the optimal widths in the univariate setting. The simultaneous validity of the confidence intervals allows to use this construction, which we call {\sl Beta-trees}, for various data-analytic purposes. We illustrate this by using Beta-trees for visualizing data and for multivariate mode-hunting.
翻译:由于维度灾难,多元直方图的构建较为困难。受计算机科学中$k$维树的启发,我们展示了如何构建一种高效的数据自适应欧几里得空间划分方法,该方法具备以下两个特性:在较高置信度下,生成数据的分布在划分的每个矩形上接近于均匀分布;尽管该构建依赖于数据,我们仍能对划分中每个矩形的概率(进而平均密度)给出保证有限样本容量的同时置信区间。这种划分将自动适应分布接近均匀的区域大小。该方法生成的置信区间宽度仅取决于矩形的概率含量,而与空间维度无关,从而避免了维度灾难。此外,该宽度实质上与单变量情形下的最优宽度相当。置信区间的同时有效性使得这种被称为{\sl Beta树}的构造可运用于多种数据分析目的。我们通过将Beta树用于数据可视化和多元模态检测来展示其应用。