Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines' int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.

翻译：Softmax运算可能成为Transformer模型多头注意力模块中的计算瓶颈，尤其是在低精度推理的小型模型中，其指数运算与归一化会产生显著开销。为此，我们提出头校准裁剪线性Softmax（HCCS）——一种有界单调的指数Softmax替代函数，采用对最大中心化注意力logits的裁剪线性映射。该近似方法可生成稳定概率分布，保持原始logits的排序关系，且输出非负值。HCCS与先前Softmax替代函数的关键区别在于，它包含一组轻量级校准参数，这些参数基于代表性数据集进行离线优化，并针对每个注意力头单独校准以保持各头的统计特性。我们描述了面向AMD Versal AI引擎高吞吐量场景的硬件驱动型HCCS实现方案。当前AMD平台对此架构的参考实现依赖bfloat16运算或查找表执行指数操作，这可能限制平台吞吐量，且未能充分利用AI引擎高吞吐量的整数向量处理单元。相比之下，HCCS能够自然映射到AI引擎的int8乘累加单元。据我们所知，这是首个针对AMD AI引擎优化的int8 Softmax替代函数，在经量化感知重训练后的小型或高度量化多头注意力任务中，其运行速度显著超越其他参考实现，同时保持具有竞争力的任务准确率。