We present repliclust (from repli-cate and clust-er), a Python package for generating synthetic data sets with clusters. Our approach is based on data set archetypes, high-level geometric descriptions from which the user can create many different data sets, each possessing the desired geometric characteristics. The architecture of our software is modular and object-oriented, decomposing data generation into algorithms for placing cluster centers, sampling cluster shapes, selecting the number of data points for each cluster, and assigning probability distributions to clusters. The project webpage, repliclust.org, provides a concise user guide and thorough documentation.
翻译:我们提出repliclust(源自replicate和cluster的合成词),这是一个用于生成具有聚类结构的合成数据集的Python包。我们的方法基于数据集原型——通过高级几何描述,用户可以创建多个不同数据集,每个数据集都具有所需的几何特征。该软件架构采用模块化和面向对象设计,将数据生成过程分解为:聚类中心放置算法、聚类形状采样、各聚类数据点数量选择以及概率分布分配。项目网站repliclust.org提供了简洁的用户指南和详细文档。