Distributed distribution comparison aims to measure the distance between the distributions whose data are scattered across different agents in a distributed system and cannot even be shared directly among the agents. In this study, we propose a novel decentralized entropic optimal transport (DEOT) method, which provides a communication-efficient and privacy-preserving solution to this problem with theoretical guarantees. In particular, we design a mini-batch randomized block-coordinate descent (MRBCD) scheme to optimize the DEOT distance in its dual form. The dual variables are scattered across different agents and updated locally and iteratively with limited communications among partial agents. The kernel matrix involved in the gradients of the dual variables is estimated by a decentralized kernel approximation method, in which each agent only needs to approximate and store a sub-kernel matrix by one-shot communication and without sharing raw data. Besides computing entropic Wasserstein distance, we show that the proposed MRBCD scheme and kernel approximation method also apply to entropic Gromov-Wasserstein distance. We analyze our method's communication complexity and, under mild assumptions, provide a theoretical bound for the approximation error caused by the convergence error, the estimated kernel, and the mismatch between the storage and communication protocols. In addition, we discuss the trade-off between the precision of the EOT distance and the strength of privacy protection when implementing our method. Experiments on synthetic data and real-world distributed domain adaptation tasks demonstrate the effectiveness of our method.
翻译:分布式分布比较旨在衡量分布之间的距离,这些分布的数据分散在分布式系统中的不同智能体之间,甚至无法直接在智能体之间共享。本研究提出了一种新颖的分散式熵最优传输(DEOT)方法,为该问题提供了具有理论保证的通信高效且隐私保护的解决方案。具体而言,我们设计了一种小批量随机块坐标下降(MRBCD)方案,以对偶形式优化DEOT距离。对偶变量分散在不同智能体中,通过部分智能体之间的有限通信进行局部迭代更新。对偶变量梯度中涉及的核矩阵通过分散式核近似方法进行估计,其中每个智能体仅需通过单次通信近似并存储一个子核矩阵,而无需共享原始数据。除了计算熵Wasserstein距离外,我们还表明所提出的MRBCD方案和核近似方法同样适用于熵Gromov-Wasserstein距离。我们分析了该方法的通信复杂度,并在温和假设下为近似误差提供了理论界,该误差由收敛误差、估计核以及存储与通信协议之间的不匹配所引起。此外,我们还讨论了在实施该方法时,EOT距离的精度与隐私保护强度之间的权衡。在合成数据和真实世界分布式域适应任务上的实验证明了该方法的有效性。