The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.
翻译:研究界持续寻求更先进的合成数据生成器,以可靠评估机器学习方法的优势与局限。本研究旨在通过提出一种遗传算法来增加涵盖多样化问题复杂度数据集的可用性,该算法针对分类和回归任务,将一组问题复杂度度量优化至特定目标值。对于分类任务,采用了10种复杂度度量;对于回归任务,则选取了4种展现良好优化潜力的度量。实验证实,所提出的遗传算法能够通过线性特征投影将合成创建的数据集转化为具有目标复杂度值的数据集,从而生成不同难度级别的数据集。对前沿分类器和回归器的评估表明,生成数据的复杂度与识别质量之间存在相关性。