Large Scale Traning of Graph Neural Networks for Optimal Markov-Chain Partitioning Using the Kemeny Constant

Traditional clustering algorithms often struggle to capture the complex relationships within graphs and generalise to arbitrary clustering criteria. The emergence of graph neural networks (GNNs) as a powerful framework for learning representations of graph data provides new approaches to solving the problem. Previous work has shown GNNs to be capable of proposing partitionings using a variety of criteria, however, these approaches have not yet been extended to work on Markov chains or kinetic networks. These arise frequently in the study of molecular systems and are of particular interest to the biochemical modelling community. In this work, we propose several GNN-based architectures to tackle the graph partitioning problem for Markov Chains described as kinetic networks. This approach aims to minimize how much a proposed partitioning changes the Kemeny constant. We propose using an encoder-decoder architecture and show how simple GraphSAGE-based GNNs with linear layers can outperform much larger and more expressive attention-based models in this context. As a proof of concept, we first demonstrate the method's ability to cluster randomly connected graphs. We also use a linear chain architecture corresponding to a 1D free energy profile as our kinetic network. Subsequently, we demonstrate the effectiveness of our method through experiments on a data set derived from molecular dynamics. We compare the performance of our method to other partitioning techniques such as PCCA+. We explore the importance of feature and hyperparameter selection and propose a general strategy for large-scale parallel training of GNNs for discovering optimal graph partitionings.

翻译：传统聚类算法往往难以捕捉图中的复杂关系，且难以推广至任意聚类准则。图神经网络（GNN）作为一种学习图数据表示的有力框架，为解决该问题提供了新途径。已有研究表明GNN能够基于多种准则提出划分方案，然而这些方法尚未被拓展至马尔可夫链或动力学网络——这些网络常见于分子系统研究领域，对生化建模界具有特殊意义。本文提出基于GNN的多种架构，以解决以动力学网络描述的马尔可夫链图划分问题。该方法旨在最小化所提划分对Kemeny常数的改变度。我们提出采用编码器-解码器架构，并证明在此场景下，基于简单GraphSAGE的线性层GNN能够超越规模更大、表达能力更强的注意力模型。作为概念验证，我们首先展示了该方法对随机连接图的聚类能力；同时以对应一维自由能曲线的线性链架构作为动力学网络示例。随后，我们通过分子动力学衍生数据集验证了所提方法的有效性，并与PCCA+等划分技术进行性能比较。进一步探究了特征与超参数选择的重要性，并提出了用于发现最优图划分的大规模并行训练GNN通用策略。