Analysis and processing of data is a vital part of our modern society and requires vast amounts of computational resources. To reduce the computational burden, compressing and approximating data has become a central topic. We consider the approximation of labeled data samples, mathematically described as site-to-value maps between finite metric spaces. Within this setting, we identify the discrete modulus of continuity as an effective data-intrinsic quantity to measure regularity of site-to-value maps without imposing further structural assumptions. We investigate the consistency of the discrete modulus of continuity in the infinite data limit and propose an algorithm for its efficient computation. Building on these results, we present a sample based approximation theory for labeled data. For data subject to statistical uncertainty we consider multilevel approximation spaces and a variant of the multilevel Monte Carlo method to compute statistical quantities of interest. Our considerations connect approximation theory for labeled data in metric spaces to the covering problem for (random) balls on the one hand and the efficient evaluation of the discrete modulus of continuity to combinatorial optimization on the other hand. We provide extensive numerical studies to illustrate the feasibility of the approach and to validate our theoretical results.
翻译:数据分析和处理是现代社会的关键组成部分,需要大量的计算资源。为减轻计算负担,数据压缩与逼近已成为核心课题。我们考虑带标签数据样本的逼近问题,其数学描述为有限度量空间之间的位置-值映射。在此框架下,我们提出离散连续模作为一种有效的数据固有量,可在不施加额外结构假设的情况下度量位置-值映射的规律性。我们研究了离散连续模在无限数据极限下的相容性,并提出其高效计算算法。基于这些结果,我们建立了带标签数据的基于样本的逼近理论。对于存在统计不确定性的数据,我们采用多级逼近空间及多级蒙特卡洛方法的变体来计算感兴趣的统计量。我们的研究一方面将度量空间中带标签数据的逼近理论与(随机)球覆盖问题相联系,另一方面将离散连续模的高效计算与组合优化问题相关联。我们通过大量数值研究验证了该方法的可行性并支撑了理论结果。