Since their introduction the Trasformer architectures emerged as the dominating architectures for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family of "fully-attentive" architectures arises from the computation of the dot-product attention, which grows both in memory consumption and number of operations as $O(n^2)$ where $n$ stands for the input sequence length, thus limiting the applications that require modeling very long sequences. Several approaches have been proposed so far in the literature to mitigate this issue, with varying degrees of success. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module by leveraging the properties of the Discrete Cosine Transform. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time. This makes it particularly suitable in real-time contexts on embedded platforms. Moreover, we assume that the results of our research might serve as a starting point for a broader family of deep neural models with reduced memory footprint. The implementation will be made publicly available at https://github.com/cscribano/DCT-Former-Public
翻译:摘要:自提出以来,Transformer架构已成为自然语言处理及近期计算机视觉应用中的主导架构。这类"全注意力"架构本身存在固有限制,即点积注意力的计算导致内存消耗和运算量均随输入序列长度$n$呈$O(n^2)$增长,从而限制了需要建模超长序列的应用场景。现有文献已提出多种缓解该问题的方法,但效果参差不齐。我们的研究灵感源于有损数据压缩领域(如JPEG算法),通过利用离散余弦变换的特性推导出注意力模块的近似实现。大量实验表明,在保持相同性能的前提下,本方法不仅内存占用更少,还显著降低了推理时间,特别适用于嵌入式平台上的实时场景。此外,我们相信本研究成果可作为拓展至更大规模低内存深度神经网络模型的起点。相关实现代码将在https://github.com/cscribano/DCT-Former-Public 公开发布。