Efficient computation of optimal transport distance between distributions is of growing importance in data science. Sinkhorn-based methods are currently the state-of-the-art for such computations, but require $O(n^2)$ computations. In addition, Sinkhorn-based methods commonly use an Euclidean ground distance between datapoints. However, with the prevalence of manifold structured scientific data, it is often desirable to consider geodesic ground distance. Here, we tackle both issues by proposing Geodesic Sinkhorn -- based on diffusing a heat kernel on a manifold graph. Notably, Geodesic Sinkhorn requires only $O(n\log n)$ computation, as we approximate the heat kernel with Chebyshev polynomials based on the sparse graph Laplacian. We apply our method to the computation of barycenters of several distributions of high dimensional single cell data from patient samples undergoing chemotherapy. In particular, we define the barycentric distance as the distance between two such barycenters. Using this definition, we identify an optimal transport distance and path associated with the effect of treatment on cellular data.
翻译:分布间最优传输距离的高效计算在数据科学中日益重要。基于Sinkhorn的方法是目前此类计算的最先进技术,但需要$O(n^2)$次计算。此外,Sinkhorn方法通常使用数据点间的欧几里得地面距离。然而,随着流形结构化科学数据的普及,考虑测地线地面距离往往更可取。本文通过提出测地线Sinkhorn算法——基于在流形图上扩散热核——同时解决这两个问题。值得注意的是,由于我们利用稀疏图拉普拉斯矩阵的切比雪夫多项式近似热核,测地线Sinkhorn仅需$O(n\log n)$次计算。我们将该方法应用于接受化疗的患者样本中高维单细胞数据多个分布的质心计算。特别地,我们将质心距离定义为两个此类质心之间的距离。基于这一定义,我们识别出与药物处理对细胞数据影响相关的最优传输距离及路径。