Inhomogeneities in real-world data, e.g., due to changes in the observation noise level or variations in the structural complexity of the source function, pose a unique set of challenges for statistical inference. Accounting for them can greatly improve predictive power when physical resources or computation time is limited. In this paper, we draw on recent theoretical results on the estimation of local function complexity (LFC), derived from the domain of local polynomial smoothing (LPS), to establish a notion of local structural complexity, which is used to develop a model-agnostic active learning (AL) framework. Due to its reliance on pointwise estimates, the LPS model class is not robust and scalable concerning large input space dimensions that typically come along with real-world problems. Here, we derive and estimate the Gaussian process regression (GPR)-based analog of the LPS-based LFC and use it as a substitute in the above framework to make it robust and scalable. We assess the effectiveness of our LFC estimate in an AL application on a prototypical low-dimensional synthetic dataset, before taking on the challenging real-world task of reconstructing a quantum chemical force field for a small organic molecule and demonstrating state-of-the-art performance with a significantly reduced training demand.
翻译:现实世界数据中的非均匀性(例如观测噪声水平的变化或源函数结构复杂度的差异)给统计推断带来了独特的挑战。当物理资源或计算时间有限时,考虑这些因素可以显著提升预测能力。本文借鉴局部多项式平滑(LPS)领域关于局部函数复杂度(LFC)估计的最新理论成果,建立了一种局部结构复杂度的概念,并据此开发了一种模型无关的主动学习(AL)框架。由于依赖逐点估计,LPS模型类在应对现实问题中常见的高维输入空间时缺乏鲁棒性和可扩展性。本文推导并估计了基于高斯过程回归(GPR)的LFC替代指标(即LPS-LFC的GPR类比项),并将其应用于上述框架以提升鲁棒性与可扩展性。我们首先在典型的低维合成数据集上的主动学习应用中评估了LFC估计的有效性,随后将其应用于重构小有机分子量子化学力场这一具有挑战性的真实任务,最终以显著降低的训练需求实现了当前最优性能。