The Wasserstein distance is a discrepancy measure between probability distributions, defined by an optimal transport problem. It has been used for various tasks such as retrieving similar items in high-dimensional images or text data. In retrieval applications, however, the Wasserstein distance is calculated repeatedly, and its cubic time complexity with respect to input size renders it unsuitable for large-scale datasets. Recently, tree-based approximation methods have been proposed to address this bottleneck. For example, the Flowtree algorithm computes transport on a quadtree and evaluates cost using the ground metric, and clustering-tree approaches have been reported to achieve high accuracy. However, these existing trees often incur significant construction time for preprocessing, and crucially, standard quadtrees cannot grow deep enough in high-dimensional spaces, resulting in poor approximation accuracy. In this paper, we propose kd-Flowtree, a kd-tree-based Wasserstein distance approximation method that uses a kd-tree for data embedding. Since kd-trees can grow sufficiently deep and adaptively even in high-dimensional cases, kd-Flowtree is capable of maintaining good approximation accuracy for such cases. In addition, kd-trees can be constructed quickly than quadtrees, which contributes to reducing the computation time required for nearest neighbor search, including preprocessing. We provide a probabilistic upper bound on the nearest-neighbor search accuracy of kd-Flowtree, and show that this bound is independent of the dataset size. In the numerical experiments, we demonstrated that kd-Flowtree outperformed the existing Wasserstein distance approximation methods for retrieval tasks with real-world data.
翻译:Wasserstein距离是一种基于最优传输问题定义的概率分布差异度量方法,已广泛应用于高维图像或文本数据中相似项检索等任务。然而在检索应用中,Wasserstein距离需要重复计算,其相对于输入规模的三次方时间复杂度使其难以适用于大规模数据集。近年来,基于树的近似方法被提出以解决这一瓶颈问题。例如,Flowtree算法通过在四叉树上计算传输并利用基础度量评估成本,聚类树方法也被报道能够实现较高精度。然而,现有树结构通常需要耗费大量预处理构建时间,且关键问题在于标准四叉树在高维空间中无法达到足够深度,导致近似精度较差。本文提出kd-Flowtree——一种基于kd树的Wasserstein距离近似方法,该方法利用kd树进行数据嵌入。由于kd树即使在高维情况下也能达到足够深度并自适应生长,因此kd-Flowtree能够在此类情况下保持良好的近似精度。此外,与四叉树相比,kd树能够更快构建,这有助于减少包括预处理在内的最近邻搜索所需计算时间。我们给出了kd-Flowtree最近邻搜索精度的概率上界,并证明该上界与数据集规模无关。在数值实验中,我们证明了kd-Flowtree在真实数据检索任务中优于现有的Wasserstein距离近似方法。