Large Scale Training of Graph Neural Networks for Optimal Markov-Chain Partitioning Using the Kemeny Constant

Traditional clustering algorithms often struggle to capture the complex relationships within graphs and generalise to arbitrary clustering criteria. The emergence of graph neural networks (GNNs) as a powerful framework for learning representations of graph data provides new approaches to solving the problem. Previous work has shown GNNs to be capable of proposing partitionings using a variety of criteria, however, these approaches have not yet been extended to work on Markov chains or kinetic networks. These arise frequently in the study of molecular systems and are of particular interest to the biochemical modelling community. In this work, we propose several GNN-based architectures to tackle the graph partitioning problem for Markov Chains described as kinetic networks. This approach aims to minimize how much a proposed partitioning changes the Kemeny constant. We propose using an encoder-decoder architecture and show how simple GraphSAGE-based GNNs with linear layers can outperform much larger and more expressive attention-based models in this context. As a proof of concept, we first demonstrate the method's ability to cluster randomly connected graphs. We also use a linear chain architecture corresponding to a 1D free energy profile as our kinetic network. Subsequently, we demonstrate the effectiveness of our method through experiments on a data set derived from molecular dynamics. We compare the performance of our method to other partitioning techniques such as PCCA+. We explore the importance of feature and hyperparameter selection and propose a general strategy for large-scale parallel training of GNNs for discovering optimal graph partitionings.

翻译：传统聚类算法通常难以捕捉图内部的复杂关系，且难以推广至任意聚类准则。图神经网络作为学习图数据表示的有力框架，为解决该问题提供了新思路。已有研究表明，图神经网络能够基于多种准则提出划分方案，但尚未扩展至马尔可夫链或动力学网络。这些结构常见于分子系统研究，并对生物化学建模领域具有特殊意义。本文针对以动力学网络描述的马尔可夫链，提出若干基于图神经网络的架构来解决图划分问题。该方法旨在最小化所提划分对Kemeny常数的改变量。我们采用编码器-解码器架构，并证明在此场景中，基于GraphSAGE的简单图神经网络结合线性层，其性能优于规模更大、表达能力更强的注意力模型。作为概念验证，我们首先展示了该方法对随机连接图的聚类能力，同时采用对应一维自由能曲线的线性链架构作为动力学网络。随后，我们通过基于分子动力学数据集开展的实验验证了该方法的有效性，并将其性能与PCCA+等划分技术进行比较。我们深入探讨了特征与超参数选择的重要性，并提出一种大规模并行训练图神经网络以发现最优图划分的通用策略。