Inhomogeneities in real-world data, e.g., due to changes in the observation noise level or variations in the structural complexity of the source function, pose a unique set of challenges for statistical inference. Accounting for them can greatly improve predictive power when physical resources or computation time is limited. In this paper, we draw on recent theoretical results on the estimation of local function complexity (LFC), derived from the domain of local polynomial smoothing (LPS), to establish a notion of local structural complexity, which is used to develop a model-agnostic active learning (AL) framework. Due to its reliance on pointwise estimates, the LPS model class is not robust and scalable concerning large input space dimensions that typically come along with real-world problems. Here, we derive and estimate the Gaussian process regression (GPR)-based analog of the LPS-based LFC and use it as a substitute in the above framework to make it robust and scalable. We assess the effectiveness of our LFC estimate in an AL application on a prototypical low-dimensional synthetic dataset, before taking on the challenging real-world task of reconstructing a quantum chemical force field for a small organic molecule and demonstrating state-of-the-art performance with a significantly reduced training demand.
翻译:现实世界数据中的不均匀性(例如观测噪声水平的变化或源函数结构复杂性的差异)给统计推断带来了一系列独特挑战。当物理资源或计算时间受限时,考虑这些不均匀性可以显著提升预测能力。本文借鉴局部多项式平滑(LPS)领域关于局部函数复杂度(LFC)估计的最新理论成果,建立了一种局部结构复杂度概念,并据此开发了与模型无关的主动学习(AL)框架。由于依赖逐点估计,LPS模型类在应对实际问题的较大输入空间维度时缺乏鲁棒性和可扩展性。本文推导并估计了基于高斯过程回归(GPR)的LFC类比方法(替代基于LPS的LFC),并将其应用于上述框架中,以提升鲁棒性和可扩展性。我们在典型低维合成数据集上评估了LFC估计在主动学习应用中的有效性,随后将其应用于重构小有机分子量子化学力场这一具有挑战性的实际任务,并以显著降低的训练需求展示了最先进的性能。