As machine learning gets deployed more and more widely, and model sizes continue to grow, improving computational efficiency during model inference has become a key challenge. In many commonly used model architectures, including Transformers, a significant portion of the inference computation is comprised of exponential non-linearities such as Softmax. In this work, we develop QuAKE, a collection of novel operators that leverage certain properties of IEEE-754 floating point representations to quickly approximate the exponential function without requiring specialized hardware, extra memory, or precomputation. We propose optimizations that enhance the efficiency of QuAKE in commonly used exponential non-linearities such as Softmax, GELU, and the Logistic function. Our benchmarks demonstrate substantial inference speed improvements between 10% and 35% on server CPUs, and 5% and 45% on embedded and mobile-scale CPUs for a variety of model architectures and sizes. Evaluations of model performance on standard datasets and tasks from various domains show that QuAKE operators are able to provide sizable speed benefits with little to no loss of performance on downstream tasks.
翻译:随着机器学习应用日益广泛且模型规模持续增长,提升模型推理阶段的计算效率已成为关键挑战。在许多常用模型架构(包括Transformer)中,推理计算的显著部分由Softmax等指数非线性运算构成。本研究开发了QuAKE——一系列新颖的算子,其利用IEEE-754浮点数表示的特定性质,在无需专用硬件、额外内存或预计算的前提下快速逼近指数函数。我们提出了针对Softmax、GELU及Logistic函数等常用指数非线性的优化方案以提升QuAKE效率。基准测试表明,在多种模型架构与规模下,服务器CPU的推理速度可提升10%至35%,嵌入式与移动端CPU则可提升5%至45%。通过在跨领域标准数据集与任务上的模型性能评估,QuAKE算子能够在几乎不影响下游任务性能的前提下提供显著的加速效益。