On the Use of Bagging for Local Intrinsic Dimensionality Estimation

The theory of Local Intrinsic Dimensionality (LID) has become a valuable tool for characterizing local complexity within and across data manifolds, supporting a range of data mining and machine learning tasks. Accurate LID estimation requires samples drawn from small neighborhoods around each query to avoid biases from nonlocal effects and potential manifold mixing, yet limited data within such neighborhoods tends to cause high estimation variance. As a variance reduction strategy, we propose an ensemble approach that uses subbagging to preserve the local distribution of nearest neighbor (NN) distances. The main challenge is that the uniform reduction in total sample size within each subsample increases the proximity threshold for finding a fixed number k of NNs around the query. As a result, in the specific context of LID estimation, the sampling rate has an additional, complex interplay with the neighborhood size, where both combined determine the sample size as well as the locality and resolution considered for estimation. We analyze both theoretically and experimentally how the choice of the sampling rate and the k-NN size used for LID estimation, alongside the ensemble size, affects performance, enabling informed prior selection of these hyper-parameters depending on application-based preferences. Our results indicate that within broad and well-characterized regions of the hyper-parameters space, using a bagged estimator will most often significantly reduce variance as well as the mean squared error when compared to the corresponding non-bagged baseline, with controllable impact on bias. We additionally propose and evaluate different ways of combining bagging with neighborhood smoothing for substantial further improvements on LID estimation performance.

翻译：局部本征维数（LID）理论已成为刻画数据流形内部及跨流形局部复杂性的重要工具，支撑着数据挖掘和机器学习领域的众多任务。精确的LID估计需要以每个查询点为中心的小邻域内抽取样本，以避免非局域效应和潜在流形混合带来的偏差，然而此类邻域内有限的数据往往导致高估计方差。作为方差缩减策略，我们提出一种集成方法，通过子装袋（subbagging）技术保持最近邻（NN）距离的局部分布。主要挑战在于：每个子样本中总样本量的均匀减少会提高查找查询点周围固定数k个最近邻的邻近阈值。因此，在LID估计的具体情境中，采样率与邻域大小之间产生了额外的复杂相互作用——两者共同决定了样本量以及估计所考虑的局域性和分辨率。我们从理论和实验两方面分析采样率选择、用于LID估计的k-NN大小以及集成规模如何影响性能，从而能够根据应用偏好对这些超参数进行先验知情选择。结果表明，在超参数空间宽泛且特征明确的区域内，与相应的非装袋基线相比，采用装袋估计器通常能显著降低方差和均方误差，且对偏差的影响可控。此外，我们提出并评估了将装袋与邻域平滑相结合的不同方式，以实质性地进一步提升LID估计性能。