The lack of freely available (real-life or synthetic) high or ultra-high dimensional, multi-class datasets may hamper the rapidly growing research on feature screening, especially in the field of biometrics, where the usage of such datasets is common. This paper reports a Python package called BiometricBlender, which is an ultra-high dimensional, multi-class synthetic data generator to benchmark a wide range of feature screening methods. During the data generation process, the overall usefulness and the intercorrelations of blended features can be controlled by the user, thus the synthetic feature space is able to imitate the key properties of a real biometric dataset.
翻译:缺乏免费可用的(真实或合成)高维或超高维多类别数据集可能阻碍快速发展的特征筛选研究,尤其是在生物特征识别领域,此类数据集的使用十分普遍。本文介绍了一个名为BiometricBlender的Python软件包,它是一个超高维多类别合成数据生成器,用于对多种特征筛选方法进行基准测试。在数据生成过程中,用户可以控制混合特征的整体效用及其互相关性,从而使合成特征空间能够模拟真实生物特征数据集的关键特性。