Learning curves are a measure for how the performance of machine learning models improves given a certain volume of training data. Over a wide variety of applications and models it was observed that learning curves follow -- to a large extent -- a power law behavior. This makes the performance of different models for a given task somewhat predictable and opens the opportunity to reduce the training time for practitioners, who are exploring the space of possible models and hyperparameters for the problem at hand. By estimating the learning curve of a model from training on small subsets of data only the best models need to be considered for training on the full dataset. How to choose subset sizes and how often to sample models on these to obtain estimates is however not researched. Given that the goal is to reduce overall training time strategies are needed that sample the performance in a time-efficient way and yet leads to accurate learning curve estimates. In this paper we formulate the framework for these strategies and propose several strategies. Further we evaluate the strategies for simulated learning curves and in experiments with popular datasets and models for image classification tasks.
翻译:学习曲线是衡量机器学习模型在给定一定训练数据量时性能提升的指标。在广泛的应用和模型中,观察到学习曲线在很大程度上遵循幂律行为。这使得针对特定任务不同模型的性能具有一定的可预测性,并为从业者提供了减少训练时间的机会,因为他们正在探索当前问题可能采用的模型和超参数空间。通过仅基于小数据子集的训练来估计模型的学习曲线,只需考虑最佳模型进行全数据集训练。然而,如何选择子集大小以及在这些子集上对模型进行采样的频率以获取估计值,尚未得到研究。鉴于目标是减少总体训练时间,需要采用能够以时间高效的方式采样性能,同时又能产生准确学习曲线估计的策略。在本文中,我们构建了这些策略的框架,并提出了若干策略。进一步地,我们在模拟学习曲线以及使用流行数据集和模型进行图像分类任务的实验中评估了这些策略。