This work introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference. Additionally, a new control flow mechanism, called dynamic batching, and a novel buffer architecture, termed a two-direction accessible register file, further reduce external memory access while improving hardware utilization.
翻译:本研究提出了创新的训练方案与训练后压缩策略,以减少Transformer模型推理过程中的外部存储器访问。此外,一种称为动态批处理的新型控制流机制,以及一种名为双向可访问寄存器文件的新型缓冲架构,在提升硬件利用率的同时进一步减少了外部存储器访问。