Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, comprehensively unit tested and documented, and is available for the Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our proposal can produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.
翻译:合成数据对于评估聚类技术、补充和扩展真实数据、以及更全面地覆盖特定问题的空间至关重要。反过来,合成数据生成器有潜力创建大量数据——当真实数据稀缺时这是一项关键活动——同时提供易于理解的生成过程和解释性工具,用于系统地研究聚类分析算法。在此,我们提出Clugen,这是一种模块化的合成数据生成过程,能够使用任意分布创建由线段支持的多维聚类。Clugen是开源的,经过了全面的单元测试并附有文档,适用于Python、R、Julia和MATLAB/Octave生态系统。我们证明,该方案能在不同维度上生成丰富多样的结果,适用于聚类算法的评估,并有潜力成为各种聚类相关研究任务中广泛使用的框架。