In recent years, Deep Learning has gained popularity for its ability to solve complex classification tasks, increasingly delivering better results thanks to the development of more accurate models, the availability of huge volumes of data and the improved computational capabilities of modern computers. However, these improvements in performance also bring efficiency problems, related to the storage of datasets and models, and to the waste of energy and time involved in both the training and inference processes. In this context, data reduction can help reduce energy consumption when training a deep learning model. In this paper, we present up to eight different methods to reduce the size of a tabular training dataset, and we develop a Python package to apply them. We also introduce a representativeness metric based on topology to measure how similar are the reduced datasets and the full training dataset. Additionally, we develop a methodology to apply these data reduction methods to image datasets for object detection tasks. Finally, we experimentally compare how these data reduction methods affect the representativeness of the reduced dataset, the energy consumption and the predictive performance of the model.
翻译:近年来,深度学习因其解决复杂分类任务的能力而广受欢迎,得益于更精确模型的开发、海量数据的可用性以及现代计算机计算能力的提升,其性能日益改善。然而,这些性能提升也带来了效率问题,涉及数据集和模型的存储,以及训练和推理过程中的能源与时间浪费。在此背景下,数据缩减有助于降低训练深度学习模型时的能耗。本文提出了多达八种缩减表格训练数据集大小的不同方法,并开发了一个用于应用这些方法的Python包。同时,我们引入了一种基于拓扑的代表性度量,用于衡量缩减后的数据集与完整训练数据集的相似程度。此外,我们开发了一套将这些数据缩减方法应用于目标检测任务图像数据集的方法。最后,我们通过实验比较了这些数据缩减方法对缩减数据集代表性、能耗以及模型预测性能的影响。