The traditional Triangular Maximally Filtered Graph (TMFG) construction requires pre-computation and storage of a dense correlation matrix; this limits its applicability to small and medium-sized datasets. Here we identify key memory and runtime complexity challenges when using TMFG at scale. We then present the Approximate Triangular Maximally Filtered Graph (a-TMFG) algorithm. This is a novel approach to scaling the construction of artificial graphs from data inspired by TMFG. The method employs k-Nearest Neighbors Graphs (kNNG) for initial construction, and implements a memory management strategy to search and estimate missing correlations on-the-fly. This provides representations to control combinatorial explosion. The algorithm is tested for robustness to the parameters and noise, and is evaluated on datasets with millions of observations. This new method provides a parsimonious way to construct graphs for use-cases where graphs are used as input to supervised and unsupervised learning but where no natural graph exists.
翻译:传统的三角最大过滤图(TMFG)构建方法需要预先计算并存储一个稠密的相关性矩阵,这限制了其仅适用于中小型数据集。本文指出了在大规模应用TMFG时面临的关键内存与运行时复杂度挑战。随后,我们提出了近似三角最大过滤图(a-TMFG)算法。这是一种受TMFG启发、用于从数据中构建人工图的新型可扩展方法。该方法采用k-最近邻图(kNNG)进行初始构建,并实施一种内存管理策略,以动态搜索和估计缺失的相关性。这提供了控制组合爆炸的表示形式。该算法针对参数和噪声的鲁棒性进行了测试,并在包含数百万观测值的数据集上进行了评估。这种新方法为那些需要将图作为监督学习和无监督学习输入、但不存在自然图的应用场景,提供了一种简约的图构建方式。