Poolformer：用于长序列建模的带池化循环网络 (Poolformer: Recurrent Networks with Pooling for Long-Sequence Modeling)

Sequence-to-sequence models have become central in Artificial Intelligence, particularly following the introduction of the transformer architecture. While initially developed for Natural Language Processing, these models have demonstrated utility across domains, including Computer Vision. Such models require mechanisms to exchange information along the time dimension, typically using recurrent or self-attention layers. However, self-attention scales quadratically with sequence length, limiting its practicality for very long sequences. We introduce Poolformer, a sequence-to-sequence model that replaces self-attention with recurrent layers and incorporates pooling operations to reduce sequence length. Poolformer is defined recursively using SkipBlocks, which contain residual blocks, a down-pooling layer, a nested SkipBlock, an up-pooling layer, and additional residual blocks. We conduct extensive experiments to support our architectural choices. Our results show that pooling greatly accelerates training, improves perceptual metrics (FID and IS), and prevents overfitting. Our experiments also suggest that long-range dependencies are handled by deep layers, while shallow layers take care of short-term features. Evaluated on raw audio, which naturally features long sequence lengths, Poolformer outperforms state-of-the-art models such as SaShiMi and Mamba. Future directions include applications to text and vision, as well as multi-modal scenarios, where a Poolformer-based LLM could effectively process dense representations of images and videos.

翻译：序列到序列模型已成为人工智能领域的核心，特别是在Transformer架构引入之后。虽然最初为自然语言处理而开发，但这些模型已在包括计算机视觉在内的多个领域展现出实用性。此类模型需要沿时间维度交换信息的机制，通常使用循环层或自注意力层。然而，自注意力的计算复杂度随序列长度呈二次方增长，限制了其在超长序列上的实际应用。我们提出了Poolformer，这是一种用循环层替代自注意力并融入池化操作以缩减序列长度的序列到序列模型。Poolformer通过SkipBlock递归定义，其中包含残差块、下池化层、嵌套SkipBlock、上池化层以及附加残差块。我们进行了大量实验以验证架构设计的合理性。结果表明，池化操作能显著加速训练、提升感知指标（FID和IS）并防止过拟合。实验还表明，深层网络负责处理长程依赖，而浅层网络则处理短期特征。在具有天然长序列特性的原始音频数据上评估时，Poolformer在性能上超越了SaShiMi和Mamba等最先进模型。未来研究方向包括在文本与视觉领域的应用，以及多模态场景——基于Poolformer的大型语言模型有望高效处理图像和视频的密集表征。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日