Recently, Transformers (e.g., Audio Spectrogram Transformers, AST) and state-space models (e.g., Audio Mamba, AuM) have achieved remarkable progress in audio modeling. However, the O(L^2) computational complexity of the Transformer architecture hinders efficient long-sequence processing, while the Mamba architecture tends to become unstable when scaling parameters and data. To address these challenges, this paper proposes AudioRWKV (A-RWKV), a highly efficient and stable architecture for audio modeling. Specifically, we inherit the stable and efficient recurrent formulation of RWKV7 and replace its 1D token-shift operation with a 2D depthwise separable convolution to better capture local spectro-temporal patterns. Furthermore, we adapt the original causal WKV kernel into a bidirectional WKV kernel (Bi-WKV), enabling global context modeling over the entire audio sequence while maintaining linear computational complexity. Benefiting from the inherent stability of the RWKV7 foundation, A-RWKV scales seamlessly to larger model sizes. Experimental results demonstrate that, under the same linear-model regime, A-RWKV-S (22M) achieves performance parity with AuM-B (92M) while exhibiting more stable throughput than AST; for long-form audio (~5 minutes 28 seconds), WKV7 achieves up to a 13.3X speedup in processing.
翻译:近期,Transformer架构(如音频频谱图Transformer,AST)和状态空间模型(如音频曼巴,AuM)在音频建模领域取得了显著进展。然而,Transformer架构的O(L²)计算复杂度限制了其高效处理长序列的能力,而曼巴架构在扩展参数与数据规模时易出现不稳定性。为解决上述挑战,本文提出AudioRWKV(A-RWKV)——一种面向音频建模的高效稳定架构。具体而言,我们继承了RWKV7稳定高效的循环计算范式,将其一维令牌平移操作替换为二维深度可分离卷积,以更精准地捕捉局部频谱-时间模式。此外,我们将原始因果WKV核改造为双向WKV核(Bi-WKV),在保持线性计算复杂度的同时实现全局音频序列的上下文建模。得益于RWKV7基座的内在稳定性,A-RWKV可无缝扩展至更大模型规模。实验结果表明,在相同的线性模型框架下,A-RWKV-S(22M)在性能上与AuM-B(92M)持平,且吞吐量稳定性优于AST;对于长音频(约5分28秒),WKV7的处理速度最高提升13.3倍。