Natural language processing (NLP) made an impressive jump with the introduction of Transformers. ChatGPT is one of the most famous examples, changing the perception of the possibilities of AI even outside the research community. However, besides the impressive performance, the quadratic time and space complexity of Transformers with respect to sequence length pose significant limitations for handling long sequences. While efficient Transformer architectures like Linformer and Performer with linear complexity have emerged as promising solutions, their theoretical understanding remains limited. In this paper, we introduce Sumformer, a novel and simple architecture capable of universally approximating equivariant sequence-to-sequence functions. We use Sumformer to give the first universal approximation results for Linformer and Performer. Moreover, we derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
翻译:自然语言处理(NLP)随着Transformer的引入实现了飞跃式进展。ChatGPT即是其中最著名的范例,它甚至改变了研究领域之外对人工智能潜力的认知。然而,除却卓越性能,Transformer在序列长度方面的二次时间与空间复杂度,对处理长序列构成了重大限制。尽管具有线性复杂度的Linformer和Performer等高效Transformer架构作为有前景的解决方案已经出现,但其理论理解仍十分有限。本文提出Sumformer——一种简单的新型架构,能够通用地逼近等变序列到序列函数。我们借助Sumformer首次获得了Linformer和Performer的通用逼近结果。此外,我们还为Transformer推导出新的证明,表明仅需单层注意力即可实现通用逼近。