Many statistical problems require estimating a density function, say $f$, from data samples. In this work, for example, we are interested in highest-density regions (HDRs), i.e., minimum volume sets that contain a given probability. HDRs are typically computed using a density quantile approach, which, in the case of unknown densities, involves their estimation. This task turns out to be far from trivial, especially over increased dimensions and when data are sparse and exhibit complex structures (e.g., multimodalities or particular dependencies). We address this challenge by exploring alternative approaches to build HDRs that overcome direct (multivariate) density estimation. First, we generalize the density quantile method, currently implementable on the basis of a consistent estimator of the density, to $neighbourhood$ measures, i.e., measures that preserve the order induced in the sample by $f$. Second, we discuss a number of suitable probabilistic- and distance-based measures such as the $k$-nearest neighbourhood Euclidean distance. Third, motivated by the ubiquitous role of $copula$ modeling in modern statistics, we explore its use in the context of probabilistic-based measures. An extensive comparison among the introduced measures is provided, and their implications for computing HDRs in real-world problems are discussed.
翻译:许多统计问题需要从数据样本中估计密度函数,例如 $f$。在这项工作中,我们重点关注最高密度区域(HDR),即包含给定概率的最小体积集。HDR 通常通过密度分位数方法计算,当密度未知时,该方法涉及密度估计。这一任务远非易事,尤其是在维数增加、数据稀疏且存在复杂结构(例如多模态性或特定依赖关系)的情况下。我们通过探索替代方法来构建 HDR,从而规避直接(多变量)密度估计的挑战。首先,我们将目前基于密度一致估计量实现的密度分位数方法推广到邻域度量,即保留由 $f$ 诱导的样本顺序的度量。其次,我们讨论了一系列合适的基于概率和距离的度量,例如 $k$-近邻欧氏距离。第三,受现代统计学中 copula 建模的普遍作用的启发,我们探索了其在基于概率度量中的应用。我们对所引入的度量进行了广泛的比较,并讨论了它们在实际问题中计算 HDR 的含义。