Machine learning models have achieved remarkable success in various real-world applications such as data science, computer vision, and natural language processing. However, model training in machine learning requires large-scale data sets and multiple iterations before it can work properly. Parallelization of training algorithms is a common strategy to speed up the process of training. However, many studies on model training and inference focus only on aspects of performance. Power consumption is also an important metric for any type of computation, especially high-performance applications. Machine learning algorithms that can be used on low-power platforms such as sensors and mobile devices have been researched, but less power optimization is done for algorithms designed for high-performance computing. In this paper, we present a C++ implementation of logistic regression and the genetic algorithm, and a Python implementation of neural networks with stochastic gradient descent (SGD) algorithm on classification tasks. We will show the impact that the complexity of the model and the size of the training data have on the parallel efficiency of the algorithm in terms of both power and performance. We also tested these implementations using shard-memory parallelism, distributed memory parallelism, and GPU acceleration to speed up machine learning model training.
翻译:机器学习模型在数据科学、计算机视觉和自然语言处理等真实应用中取得了显著成功。然而,机器学习模型训练需要大规模数据集和多次迭代才能正常运行。训练算法的并行化是加速训练过程的常用策略。但现有关于模型训练与推理的研究大多仅关注性能方面。功耗也是各类计算场景(尤其是高性能应用)的重要评估指标。尽管已有研究探讨适用于传感器和移动设备等低功耗平台的机器学习算法,但针对高性能计算场景设计的算法在功耗优化方面仍显不足。本文提出基于C++实现的逻辑回归与遗传算法,以及基于Python实现采用随机梯度下降(SGD)算法的分类任务神经网络。我们将展示模型复杂度与训练数据规模对并行算法在功耗与性能两方面的效率影响。同时采用共享内存并行、分布式内存并行及GPU加速三种策略进行实现测试,以加速机器学习模型训练。