This work is motivated by the study of local protein structure, which is defined by two variable dihedral angles that take values from probability distributions on the flat torus. Our goal is to provide the space $\mathcal{P}(\mathbb{R}^2/\mathbb{Z}^2)$ with a metric that quantifies local structural modifications due to changes in the protein sequence, and to define associated two-sample goodness-of-fit testing approaches. Due to its adaptability to the space geometry, we focus on the Wasserstein distance as a metric between distributions. We extend existing results of the theory of Optimal Transport to the $d$-dimensional flat torus $\mathbb{T}^d=\mathbb{R}^d/\mathbb{Z}^d$, in particular a Central Limit Theorem. Moreover, we propose different approaches for two-sample goodness-of-fit testing for the one and two-dimensional case, based on the Wasserstein distance. We prove their validity and consistency. We provide an implementation of these tests in \textsf{R}. Their performance is assessed by numerical experiments on synthetic data and illustrated by an application to protein structure data.
翻译:本文的研究动机源于对局部蛋白质结构的研究,该结构由两个可变二面角定义,这些二面角取自平坦环面上的概率分布。我们的目标是为空间$\mathcal{P}(\mathbb{R}^2/\mathbb{Z}^2)$提供一种度量,以量化因蛋白质序列变化导致的局部结构修饰,并定义相关的双样本拟合优度检验方法。鉴于其对该空间几何的适应性,我们聚焦于Wasserstein距离作为分布之间的度量。我们将最优传输理论的现有结果推广至$d$维平坦环面$\mathbb{T}^d=\mathbb{R}^d/\mathbb{Z}^d$,特别是中心极限定理。此外,我们基于Wasserstein距离提出了一维和二维情形下的不同双样本拟合优度检验方法,并证明了其有效性和一致性。我们在\textsf{R}中实现了这些检验。通过合成数据的数值实验评估了其性能,并通过对蛋白质结构数据的应用进行了说明。