Automated co-design of machine learning models and evaluation hardware is critical for efficiently deploying such models at scale. Despite the state-of-the-art performance of transformer models, they are not yet ready for execution on resource-constrained hardware platforms. High memory requirements and low parallelizability of the transformer architecture exacerbate this problem. Recently-proposed accelerators attempt to optimize the throughput and energy consumption of transformer models. However, such works are either limited to a one-sided search of the model architecture or a restricted set of off-the-shelf devices. Furthermore, previous works only accelerate model inference and not training, which incurs substantially higher memory and compute resources, making the problem even more challenging. To address these limitations, this work proposes a dynamic training framework, called DynaProp, that speeds up the training process and reduces memory consumption. DynaProp is a low-overhead pruning method that prunes activations and gradients at runtime. To effectively execute this method on hardware for a diverse set of transformer architectures, we propose ELECTOR, a framework that simulates transformer inference and training on a design space of accelerators. We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models with high accuracy on the given task and minimize latency, energy consumption, and chip area. The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair while incurring 5.2$\times$ lower latency and 3.0$\times$ lower energy consumption.
翻译:机器学习模型与评估硬件的自动化协同设计,对于大规模高效部署此类模型至关重要。尽管Transformer模型具备最先进的性能,但其尚未能在资源受限的硬件平台上执行。Transformer架构的高内存需求和低并行性加剧了这一问题。近期提出的加速器试图优化Transformer模型的吞吐量和能耗,然而,此类工作要么局限于模型架构的单向搜索,要么局限于有限的现成设备集合。此外,以往的工作仅加速模型推理而非训练,而训练过程所需的内存和计算资源大幅增加,使得问题更具挑战性。为弥补这些不足,本文提出了一种名为DynaProp的动态训练框架,该框架能加速训练过程并降低内存消耗。DynaProp是一种低开销的剪枝方法,可在运行时剪枝激活值和梯度。为在硬件上有效执行该方法以支持多样化的Transformer架构,我们提出了ELECTOR框架,该框架可在加速器设计空间中模拟Transformer推理与训练。我们将此模拟器与所提出的协同设计技术TransCODE相结合,以在给定任务上获得具有高精度的最佳性能模型,并最小化延迟、能耗和芯片面积。所获得的Transformer-加速器组合相比现有最优组合,精度提高了0.3%,同时延迟降低了5.2倍,能耗降低了3.0倍。