The problem of predicting the training time of machine learning (ML) models has become extremely relevant in the scientific community. Being able to predict a priori the training time of an ML model would enable the automatic selection of the best model both in terms of energy efficiency and in terms of performance in the context of, for instance, MLOps architectures. In this paper, we present the work we are conducting towards this direction. In particular, we present an extensive empirical study of the Full Parameter Time Complexity (FPTC) approach by Zheng et al., which is, to the best of our knowledge, the only approach formalizing the training time of ML models as a function of both dataset's and model's parameters. We study the formulations proposed for the Logistic Regression and Random Forest classifiers, and we highlight the main strengths and weaknesses of the approach. Finally, we observe how, from the conducted study, the prediction of training time is strictly related to the context (i.e., the involved dataset) and how the FPTC approach is not generalizable.
翻译:预测机器学习模型训练时间的问题已在科学界变得极为相关。能够先验地预测机器学习模型的训练时间,将能够在例如MLOps架构的背景下,自动选择在能源效率和性能方面最优的模型。本文介绍了我们正朝着这一方向开展的研究工作。具体而言,我们对Zheng等人提出的全参数时间复杂度方法进行了广泛的实证研究,据我们所知,这是唯一将机器学习模型训练时间形式化为数据集参数和模型参数函数的方法。我们研究了针对逻辑回归和随机森林分类器提出的公式,并指出了该方法的主要优势与不足。最后,我们通过研究发现,训练时间的预测与具体情境(即所涉及的数据集)密切相关,且FPTC方法不具备泛化性。