AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition

Recently, Transformers (e.g., Audio Spectrogram Transformers, AST) and state-space models (e.g., Audio Mamba, AuM) have achieved remarkable progress in audio modeling. However, the O(L^2) computational complexity of the Transformer architecture hinders efficient long-sequence processing, while the Mamba architecture tends to become unstable when scaling parameters and data. To address these challenges, this paper proposes AudioRWKV (A-RWKV), a highly efficient and stable architecture for audio modeling. Specifically, we inherit the stable and efficient recurrent formulation of RWKV7 and replace its 1D token-shift operation with a 2D depthwise separable convolution to better capture local spectro-temporal patterns. Furthermore, we adapt the original causal WKV kernel into a bidirectional WKV kernel (Bi-WKV), enabling global context modeling over the entire audio sequence while maintaining linear computational complexity. Benefiting from the inherent stability of the RWKV7 foundation, A-RWKV scales seamlessly to larger model sizes. Experimental results demonstrate that, under the same linear-model regime, A-RWKV-S (22M) achieves performance parity with AuM-B (92M) while exhibiting more stable throughput than AST; for long-form audio (~5 minutes 28 seconds), WKV7 achieves up to a 13.3X speedup in processing.

翻译：近期，Transformer架构（如音频频谱图Transformer，AST）和状态空间模型（如音频曼巴，AuM）在音频建模领域取得了显著进展。然而，Transformer架构的O(L²)计算复杂度限制了其高效处理长序列的能力，而曼巴架构在扩展参数与数据规模时易出现不稳定性。为解决上述挑战，本文提出AudioRWKV（A-RWKV）——一种面向音频建模的高效稳定架构。具体而言，我们继承了RWKV7稳定高效的循环计算范式，将其一维令牌平移操作替换为二维深度可分离卷积，以更精准地捕捉局部频谱-时间模式。此外，我们将原始因果WKV核改造为双向WKV核（Bi-WKV），在保持线性计算复杂度的同时实现全局音频序列的上下文建模。得益于RWKV7基座的内在稳定性，A-RWKV可无缝扩展至更大模型规模。实验结果表明，在相同的线性模型框架下，A-RWKV-S（22M）在性能上与AuM-B（92M）持平，且吞吐量稳定性优于AST；对于长音频（约5分28秒），WKV7的处理速度最高提升13.3倍。

相关内容

模式识别

关注 5

模式识别就是通过计算机用数学技术方法来研究模式的自动处理和判读，把环境与客体统称为“模式”。随着计算机技术的发展，人类有可能研究复杂的信息处理过程，其过程的一个重要形式是生命体对环境及客体的识别。模式识别以图像处理与计算机视觉、语音语言信息处理、脑网络组、类脑智能等为主要研究方向，研究人类模式识别的机理以及有效的计算方法

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

14+阅读 · 5月21日

音视频大数据基础模型全面综述

专知会员服务

11+阅读 · 5月7日

【NeurIPS2023】跨模态提示：适应大型预训练模型用于音频-视觉下游任务

专知会员服务

29+阅读 · 2023年11月11日

【ICML2022】Branchformer:并行MLP-Attention架构，捕捉局部和全局上下文，用于语音识别和理解

专知会员服务

25+阅读 · 2022年7月8日