Transformer neural networks achieve state-of-the-art accuracy across language and vision tasks, but their deployment on embedded hardware is hindered by stringent area, latency, and energy constraints. During inference, performance and efficiency are increasingly dominated by the Key--Value (KV) cache, whose memory footprint grows with sequence length, straining on-chip memory utilization. Although existing mechanisms such as Grouped-Query Attention (GQA) reduce KV cache requirements compared to Multi-Head Attention (MHA), effectively exploiting this reduction requires understanding how on-chip memory demand evolves over time. This work presents TRAPTI, a two-stage methodology that combines cycle-level inference simulation with time-resolved analysis of on-chip memory occupancy to guide design decisions. In the first stage, the framework obtains memory occupancy traces and memory access statistics from simulation. In the second stage, the framework leverages the traces to explore banked memory organizations and power-gating configurations in an offline optimization flow. We apply this methodology to GPT-2 XL and DeepSeek-R1-Distill-Qwen-1.5B under the same accelerator configuration, enabling a direct comparison of MHA and GQA memory profiles. The analysis shows that DeepSeek-R1-Distill-Qwen-1.5B exhibits a 2.72x reduction in peak on-chip memory utilization in this setting compared to GPT-2 XL, unlocking further opportunities for power-gating optimization.
翻译:Transformer神经网络在语言和视觉任务中均能达到最先进的精度,但其在嵌入式硬件上的部署受到严格的面积、延迟和能量约束的限制。在推理过程中,性能与效率日益受制于键值缓存(Key-Value cache),其内存占用随序列长度增长,给片上内存利用率带来压力。尽管现有机制(如分组查询注意力GQA)相比多头注意力(MHA)能减少KV缓存需求,但有效利用这种缩减需要理解片上内存需求如何随时间演变。本文提出TRAPTI,一种两阶段方法论,通过结合周期级推理模拟与片上内存占用的时间分辨分析来指导设计决策。第一阶段,框架从模拟中获取内存占用轨迹与内存访问统计;第二阶段,框架利用这些轨迹在离线优化流程中探索分块内存组织与功率门控配置。我们将此方法应用于相同加速器配置下的GPT-2 XL与DeepSeek-R1-Distill-Qwen-1.5B,实现MHA与GQA内存配置文件的直接比较。分析表明,在该设置下,DeepSeek-R1-Distill-Qwen-1.5B的峰值片上内存利用率相比GPT-2 XL降低2.72倍,从而为功率门控优化释放了更多潜力。