Many scientific problems involve data that is embedded in a space with periodic boundary conditions. This can for instance be related to an inherent cyclic or rotational symmetry in the data or a spatially extended periodicity. When analyzing such data, well-tailored methods are needed to obtain efficient approaches that obey the periodic boundary conditions of the problem. In this work, we present a method for applying a clustering algorithm to data embedded in a periodic domain based on the DBSCAN algorithm, a widely used unsupervised machine learning method that identifies clusters in data. The proposed method internally leverages the conventional DBSCAN algorithm for domains with open boundaries, such that it remains compatible with all optimized implementations for neighborhood searches in open domains. In this way, it retains the same optimized runtime complexity of $O(N\log N)$. We demonstrate the workings of the proposed method using synthetic data in one, two and three dimensions and also apply it to a real-world example involving the clustering of bubbles in a turbulent flow. The proposed approach is implemented in a ready-to-use Python package that we make publicly available.
翻译:许多科学问题涉及嵌入在具有周期性边界条件的空间中的数据。例如,这可能与数据中固有的循环或旋转对称性,或空间上的扩展周期性有关。在分析此类数据时,需要采用精心设计的方法,以获得遵循问题周期性边界条件的高效途径。在本工作中,我们提出了一种将聚类算法应用于嵌入周期性域中的数据的方法,该方法基于DBSCAN算法——一种广泛使用的无监督机器学习方法,用于识别数据中的聚类。所提出的方法在内部利用了适用于具有开放边界的域的传统DBSCAN算法,从而使其与所有针对开放域中邻域搜索的优化实现保持兼容。通过这种方式,它保留了相同的优化运行时复杂度 $O(N\log N)$。我们使用一维、二维和三维的合成数据演示了所提方法的运作,并将其应用于一个涉及湍流中气泡聚类的真实世界示例。所提出的方法已在一个可即用的Python软件包中实现,我们已将其公开提供。