We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision. We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models, using a variety of efficiency metrics (latency, throughput, and memory). To conduct this analysis for speech, we introduce L-HuBERT, a novel local-attention variant of a self-supervised speech model. We observe that these thresholds are (a) much higher than typical dataset sequence lengths and (b) dependent on the metric and modality, showing that choosing the right model depends on modality, task type (long-form vs. typical context) and resource constraints (time vs. memory). By visualising the breakdown of the computational costs for transformer components, we also show that non-self-attention components exhibit significant computational costs. We release our profiling toolkit at https://github.com/ajd12342/profiling-transformers .
翻译:我们首次对基于自注意力的Transformer变体在文本、语音和视觉领域的效率进行了统一研究。利用多种效率指标(延迟、吞吐量和内存),我们确定了使高效Transformer变体比原始模型更高效的输入长度阈值(临界点)。为进行语音领域的分析,我们提出了L-HuBERT——一种自监督语音模型的新型局部注意力变体。观察发现:(a)这些阈值远高于典型数据集的序列长度;(b)阈值依赖于指标和模态,表明选择合适模型需考虑模态、任务类型(长文本与典型上下文)及资源限制(时间与内存)。通过可视化Transformer组件的计算成本分解,我们进一步表明非自注意力组件具有显著的计算开销。我们已公开性能分析工具包:https://github.com/ajd12342/profiling-transformers。