Finding meaningful ways to measure the statistical dependency between random variables $\xi$ and $\zeta$ is a timeless statistical endeavor. In recent years, several novel concepts, like the distance covariance, have extended classical notions of dependency to more general settings. In this article, we propose and study an alternative framework that is based on optimal transport. The transport dependency $\tau \ge 0$ applies to general Polish spaces and intrinsically respects metric properties. For suitable ground costs, independence is fully characterized by $\tau = 0$. Via proper normalization of $\tau$, three transport correlations $\rho_\alpha$, $\rho_\infty$, and $\rho_*$ with values in $[0, 1]$ are defined. They attain the value $1$ if and only if $\zeta = \varphi(\xi)$, where $\varphi$ is an $\alpha$-Lipschitz function for $\rho_\alpha$, a measurable function for $\rho_\infty$, or a multiple of an isometry for $\rho_*$. The transport dependency can be estimated consistently by an empirical plug-in approach, but alternative estimators with the same convergence rate but significantly reduced computational costs are also proposed. Numerical results suggest that $\tau$ robustly recovers dependency between data sets with different internal metric structures. The usage for inferential tasks, like transport dependency based independence testing, is illustrated on a data set from a cancer study.
翻译:寻找有意义的方法来度量随机变量 $\xi$ 与 $\zeta$ 之间的统计依赖性是一项历久弥新的统计学研究。近年来,诸如距离协方差等若干新概念将经典的依赖性概念推广至更一般的场景。本文提出并研究了一种基于最优传输的替代性框架。传输依赖性 $\tau \ge 0$ 适用于一般波兰空间,并内在地尊重度量性质。在合适的地面代价下,独立性完全由 $\tau = 0$ 表征。通过对 $\tau$ 进行适当归一化,定义了三种取值于 $[0, 1]$ 的传输相关系数 $\rho_\alpha$、$\rho_\infty$ 和 $\rho_*$。当且仅当 $\zeta = \varphi(\xi)$ 时,这些系数达到值 $1$,其中 $\varphi$ 对于 $\rho_\alpha$ 是 $\alpha$-Lipschitz 函数,对于 $\rho_\infty$ 是可测函数,对于 $\rho_*$ 是等距映射的倍数。传输依赖性可通过经验插件方法进行一致估计,但本文也提出了具有相同收敛速率但计算成本显著降低的替代估计量。数值结果表明 $\tau$ 能稳健地恢复具有不同内部度量结构的数据集之间的依赖性。通过癌症研究数据集上的应用,展示了其在推断任务(如基于传输依赖性的独立性检验)中的用途。