Increased focus on the computational efficiency of NLP systems has motivated the design of efficient model architectures and improvements to underlying hardware accelerators. However, the resulting increases in computational throughput and reductions in floating point operations have not directly translated to improvements in wall-clock inference latency. We demonstrate that these discrepancies can be largely attributed to bottlenecks introduced by deep learning frameworks. We denote this phenomenon as the \textit{framework tax}, and observe that the disparity is growing as hardware speed increases over time. In this work, we examine this phenomenon through a series of case studies analyzing the effects of model design decisions, framework paradigms, and hardware platforms on total model latency. Code is available at https://github.com/JaredFern/Framework-Tax.
翻译:随着对自然语言处理系统计算效率关注度的提升,高效模型架构设计及底层硬件加速器的改进得到了有力推动。然而,由此带来的计算吞吐量提升与浮点运算次数减少并未直接转化为实际推理延迟的改善。我们发现,这些差异主要源于深度学习框架引入的性能瓶颈,并将这一现象称为"框架税"。观察表明,随着硬件性能随时间加速提升,这种差异正在持续扩大。本文通过一系列案例研究,系统分析了模型设计决策、框架范式及硬件平台对总推理延迟的影响。相关代码已开源至 https://github.com/JaredFern/Framework-Tax。