Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power. State-of-the-art techniques at the device level reduce the GPU frequency to enforce power constraints, leading to a significant increase in training time. To accelerate training, we propose to jointly adjust the system and application parameters (in our case, the GPU frequency and the batch size of the training task) while adhering to the power constraints on devices. We introduce a novel cross-layer methodology that combines predictions of batch size efficiency and device profiling to achieve the desired optimization. Our evaluation on real hardware shows that our method outperforms the current baselines that depend on state of the art techniques, reducing the training time by $2.4\times$ with results very close to optimal. Our measurements also indicate a substantial reduction in the overall energy used for the training process. These gains are achieved without reduction in the performance of the trained model.
翻译:在边缘设备上进行训练面临诸多挑战,因为这些设备通常资源受限,尤其在功耗方面。设备层面的最先进技术通过降低GPU频率来满足功耗约束,但这会导致训练时间显著增加。为了加速训练,我们提出在遵守设备功耗约束的同时,联合调整系统与应用参数(在我们的案例中,即GPU频率与训练任务的批次大小)。我们引入了一种新颖的跨层方法,该方法结合了批次大小效率预测与设备性能分析,以实现期望的优化。我们在真实硬件上的评估表明,我们的方法优于依赖现有最先进技术的当前基线,将训练时间减少了$2.4\times$,且结果非常接近最优。我们的测量结果还表明,训练过程的总能耗也大幅降低。这些收益的取得并未降低训练后模型的性能。