Sequence-to-sequence models have become central in Artificial Intelligence, particularly following the introduction of the transformer architecture. While initially developed for Natural Language Processing, these models have demonstrated utility across domains, including Computer Vision. Such models require mechanisms to exchange information along the time dimension, typically using recurrent or self-attention layers. However, self-attention scales quadratically with sequence length, limiting its practicality for very long sequences. We introduce Poolformer, a sequence-to-sequence model that replaces self-attention with recurrent layers and incorporates pooling operations to reduce sequence length. Poolformer is defined recursively using SkipBlocks, which contain residual blocks, a down-pooling layer, a nested SkipBlock, an up-pooling layer, and additional residual blocks. We conduct extensive experiments to support our architectural choices. Our results show that pooling greatly accelerates training, improves perceptual metrics (FID and IS), and prevents overfitting. Our experiments also suggest that long-range dependencies are handled by deep layers, while shallow layers take care of short-term features. Evaluated on raw audio, which naturally features long sequence lengths, Poolformer outperforms state-of-the-art models such as SaShiMi and Mamba. Future directions include applications to text and vision, as well as multi-modal scenarios, where a Poolformer-based LLM could effectively process dense representations of images and videos.
翻译:序列到序列模型已成为人工智能领域的核心,特别是在Transformer架构引入之后。虽然最初为自然语言处理而开发,但这些模型已在包括计算机视觉在内的多个领域展现出实用性。此类模型需要沿时间维度交换信息的机制,通常使用循环层或自注意力层。然而,自注意力的计算复杂度随序列长度呈二次方增长,限制了其在超长序列上的实际应用。我们提出了Poolformer,这是一种用循环层替代自注意力并融入池化操作以缩减序列长度的序列到序列模型。Poolformer通过SkipBlock递归定义,其中包含残差块、下池化层、嵌套SkipBlock、上池化层以及附加残差块。我们进行了大量实验以验证架构设计的合理性。结果表明,池化操作能显著加速训练、提升感知指标(FID和IS)并防止过拟合。实验还表明,深层网络负责处理长程依赖,而浅层网络则处理短期特征。在具有天然长序列特性的原始音频数据上评估时,Poolformer在性能上超越了SaShiMi和Mamba等最先进模型。未来研究方向包括在文本与视觉领域的应用,以及多模态场景——基于Poolformer的大型语言模型有望高效处理图像和视频的密集表征。