An increasingly common viewpoint is that protein dynamics data sets reside in a non-linear subspace of low conformational energy. Ideal data analysis tools for such data sets should therefore account for such non-linear geometry. The Riemannian geometry setting can be suitable for a variety of reasons. First, it comes with a rich structure to account for a wide range of geometries that can be modelled after an energy landscape. Second, many standard data analysis tools initially developed for data in Euclidean space can also be generalised to data on a Riemannian manifold. In the context of protein dynamics, a conceptual challenge comes from the lack of a suitable smooth manifold and the lack of guidelines for constructing a smooth Riemannian structure based on an energy landscape. In addition, computational feasibility in computing geodesics and related mappings poses a major challenge. This work considers these challenges. The first part of the paper develops a novel local approximation technique for computing geodesics and related mappings on Riemannian manifolds in a computationally feasible manner. The second part constructs a smooth manifold of point clouds modulo rigid body group actions and a Riemannian structure that is based on an energy landscape for protein conformations. The resulting Riemannian geometry is tested on several data analysis tasks relevant for protein dynamics data. It performs exceptionally well on coarse-grained molecular dynamics simulated data. In particular, the geodesics with given start- and end-points approximately recover corresponding molecular dynamics trajectories for proteins that undergo relatively ordered transitions with medium sized deformations. The Riemannian protein geometry also gives physically realistic summary statistics and retrieves the underlying dimension even for large-sized deformations within seconds on a laptop.
翻译:日益普遍的观点认为,蛋白质动力学数据集位于低构象能非线性子空间中。此类数据集的理想分析工具需考虑此类非线性几何特征。黎曼几何设定因多种原因具有适用性:首先,其丰富结构可模拟能量景观对应的多种几何形态;其次,许多最初为欧氏空间数据开发的标准分析工具也可推广至黎曼流形数据。在蛋白质动力学领域,概念性挑战源于缺乏合适的平滑流形,以及缺乏基于能量景观构建平滑黎曼结构的指导原则。此外,测地线及相关映射的计算可行性构成重大难题。本文针对这些挑战展开研究:第一部分提出新颖的局部近似技术,以计算可行的方式实现黎曼流形上测地线及相关映射的求解;第二部分构建点云(模去刚体群作用)的平滑流形,并建立基于蛋白质构象能量景观的黎曼结构。该黎曼几何在蛋白质动力学数据的多项分析任务中接受测试,在粗粒化分子动力学模拟数据上表现优异。具体而言,对于经历中等形变且具有相对有序转变的蛋白质,给定起止点的测地线可近似重建对应分子动力学轨迹。该黎曼蛋白质几何还能生成物理合理的汇总统计量,并在笔记本电脑上数秒内完成大形变构象的潜在维度计算。