Simulated high-dimensional data is useful for testing, validating, and improving algorithms used in dimension reduction, supervised and unsupervised learning. High-dimensional data is characterized by multiple variables that are dependent or associated in some way, such as linear, nonlinear, clustering or anomalies. Here we provide new methods for generating a variety of high-dimensional structures using mathematical functions and statistical distributions organized into the R package cardinalR. Several example data sets are also provided. These will be useful for researchers to better understand how different analytical methods work and can be improved, with a special focus on nonlinear dimension reduction methods. This package enriches the existing toolset of benchmark datasets for evaluating algorithms.
翻译:模拟高维数据对于测试、验证和改进降维、监督与无监督学习算法具有重要价值。高维数据的特征在于多个变量之间存在某种依赖或关联关系,例如线性、非线性、聚类或异常模式。本文提出了基于数学函数与统计分布生成多样化高维结构的新方法,并将其整合为R软件包cardinalR。同时提供了若干示例数据集。这些资源将帮助研究者深入理解不同分析方法的工作原理与改进方向,特别聚焦于非线性降维方法。该软件包丰富了现有用于算法评估的基准数据集工具集。