We introduce aweSOM, an open-source Python package for machine learning (ML) clustering and classification, using a Self-organizing Maps (SOM) algorithm that incorporates CPU/GPU acceleration to accommodate large ($N > 10^6$, where $N$ is the number of data points), multidimensional datasets. aweSOM consists of two main modules, one that handles the initialization and training of the SOM, and another that stacks the results of multiple SOM realizations to obtain more statistically robust clusters. Existing Python-based SOM implementations (e.g., POPSOM, Yuan (2018); MiniSom, Vettigli (2018); sklearn-som) primarily serve as proof-of-concept demonstrations, optimized for smaller datasets, but lacking scalability for large, multidimensional data. aweSOM provides a solution for this gap in capability, with good performance scaling up to $\sim 10^8$ individual points, and capable of utilizing multiple features per point. We compare the code performance against the legacy implementations it is based on, and find a 10-100x speed up, as well as significantly improved memory efficiency, due to several built-in optimizations.
翻译:本文介绍aweSOM,一个用于机器学习(ML)聚类与分类的开源Python软件包。它采用自组织映射(SOM)算法,并融合CPU/GPU加速技术,以处理大规模($N > 10^6$,其中$N$为数据点数量)、多维度的数据集。aweSOM包含两个核心模块:一个负责SOM的初始化与训练,另一个则将多次SOM实现的结果进行堆叠,以获得统计意义上更为稳健的聚类簇。现有的基于Python的SOM实现(例如POPSOM, Yuan (2018); MiniSom, Vettigli (2018); sklearn-som)主要作为概念验证演示,针对较小数据集进行了优化,但缺乏处理大型多维数据的可扩展性。aweSOM填补了这一能力空白,其性能可良好扩展至约$10^8$个独立数据点,并且能够利用每个数据点的多个特征。我们将该代码性能与其所基于的旧有实现进行比较,发现由于多项内置优化措施,其速度提升了10-100倍,同时内存效率也得到显著改善。