Executing machine learning inference tasks on resource-constrained edge devices requires careful hardware-software co-design optimizations. Recent examples have shown how transformer-based deep neural network models such as ALBERT can be used to enable the execution of natural language processing (NLP) inference on mobile systems-on-chip housing custom hardware accelerators. However, while these existing solutions are effective in alleviating the latency, energy, and area costs of running single NLP tasks, achieving multi-task inference requires running computations over multiple variants of the model parameters, which are tailored to each of the targeted tasks. This approach leads to either prohibitive on-chip memory requirements or paying the cost of off-chip memory access. This paper proposes adapter-ALBERT, an efficient model optimization for maximal data reuse across different tasks. The proposed model's performance and robustness to data compression methods are evaluated across several language tasks from the GLUE benchmark. Additionally, we demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator to extrapolate performance, power, and area improvements over the execution of a traditional ALBERT model on the same hardware platform.
翻译:在资源受限的边缘设备上执行机器学习推理任务,需要谨慎的软硬件协同设计优化。近期研究表明,诸如ALBERT这类基于Transformer的深度神经网络模型,可用于在集成定制硬件加速器的移动系统级芯片上执行自然语言处理(NLP)推理。然而,现有方案虽能有效缓解单一NLP任务运行时的延迟、能耗与面积开销,但实现多任务推理时需针对各目标任务运行多组模型参数变体,这将导致片上内存需求过高或产生片外内存访问开销。本文提出Adapter-ALBERT——一种面向跨任务最大数据复用的高效模型优化方案。通过GLUE基准测试中的多个语言任务,我们评估了该模型的性能及其对数据压缩方法的鲁棒性。此外,通过在已验证的NLP边缘加速器上进行仿真,展示将模型映射到异构片上内存架构的优势,并推断相较于在相同硬件平台上执行传统ALBERT模型所带来的性能、功耗与面积改进。