This paper targets the question of predicting machine learning classification model performance, when taking into account the number of training examples per class and not just the overall number of training examples. This leads to the a combinatorial question, which combinations of number of training examples per class should be considered, given a fixed overall training dataset size. In order to solve this question, an algorithm is suggested which is motivated from special cases of space filling design of experiments. The resulting data are modeled using models like powerlaw curves and similar models, extended like generalized linear models i.e. by replacing the overall training dataset size by a parametrized linear combination of the number of training examples per label class. The proposed algorithm has been applied on the CIFAR10 and the EMNIST datasets.
翻译:本文针对在考虑每个类别的训练样本数量而不仅仅是总训练样本数量的情况下,预测机器学习分类模型性能的问题。这引出了一个组合问题:在给定固定总训练数据集规模时,应考虑哪些每个类别训练样本数量的组合。为解决此问题,本文提出了一种受空间填充实验设计特例启发的算法。所得数据采用幂律曲线等模型及类似模型进行建模,并通过将总训练数据集规模替换为每个标签类别训练样本数量的参数化线性组合,扩展为广义线性模型等形式。该算法已在CIFAR10和EMNIST数据集上得到应用。