Solving large scale Optimal Transport (OT) in machine learning typically relies on sampling measures to obtain a tractable discrete problem. While the discrete solver's accuracy is controllable, the rate of convergence of the discretization error is governed by the intrinsic dimension of our data. Therefore, the true bottleneck is the knowledge and control of the sampling error. In this work, we tackle this issue by introducing novel estimators for both sampling error and intrinsic dimension. The key finding is a simple, tuning-free estimator of $\text{OT}_c(ρ, \hatρ)$ that utilizes the semi-dual OT functional and, remarkably, requires no OT solver. Furthermore, we derive a fast intrinsic dimension estimator from the multi-scale decay of our sampling error estimator. This framework unlocks significant computational and statistical advantages in practice, enabling us to (i) quantify the convergence rate of the discretization error, (ii) calibrate the entropic regularization of Sinkhorn divergences to the data's intrinsic geometry, and (iii) introduce a novel, intrinsic-dimension-based Richardson extrapolation estimator that strongly debiases Wasserstein distance estimation. Numerical experiments demonstrate that our geometry-aware pipeline effectively mitigates the discretization error bottleneck while maintaining computational efficiency.
翻译:在机器学习中求解大规模最优传输问题通常依赖于对测度进行采样以获得可处理的离散问题。虽然离散求解器的精度可控,但离散化误差的收敛速率由数据的本征维度决定。因此,真正的瓶颈在于对采样误差的认知与控制。本工作通过提出针对采样误差和本征维度的新型估计器来解决这一问题。核心发现是一个简单、无需调参的$\text{OT}_c(ρ, \hatρ)$估计器,该估计器利用半对偶OT泛函,且显著地无需任何OT求解器。此外,我们从采样误差估计器的多尺度衰减中推导出一个快速本征维度估计器。该框架在实践中解锁了显著的计算与统计优势,使我们能够:(i)量化离散化误差的收敛速率,(ii)根据数据的内在几何结构校准Sinkhorn散度的熵正则化,以及(iii)提出一种基于本征维度的新型Richardson外推估计器,可有效消除Wasserstein距离估计的偏差。数值实验表明,我们的几何感知流程在保持计算效率的同时,有效缓解了离散化误差这一瓶颈。