Recently, MLP structures have regained popularity, with MLP-Mixer standing out as a prominent example. In the field of computer vision, MLP-Mixer is noted for its ability to extract data information from both channel and token perspectives, effectively acting as a fusion of channel and token information. Indeed, Mixer represents a paradigm for information extraction that amalgamates channel and token information. The essence of Mixer lies in its ability to blend information from diverse perspectives, epitomizing the true concept of "mixing" in the realm of neural network architectures. Beyond channel and token considerations, it is possible to create more tailored mixers from various perspectives to better suit specific task requirements. This study focuses on the domain of audio recognition, introducing a novel model named Audio Spectrogram Mixer with Roll-Time and Hermit FFT (ASM-RH) that incorporates insights from both time and frequency domains. Experimental results demonstrate that ASM-RH is particularly well-suited for audio data and yields promising outcomes across multiple classification tasks. The models and optimal weights files will be published.
翻译:近期,MLP结构重新受到关注,其中MLP-Mixer作为典型案例脱颖而出。在计算机视觉领域,MLP-Mixer因其能从通道和词元两个维度提取数据信息而著称,本质上实现了通道信息与词元信息的融合。实际上,Mixer代表了一种融合通道与词元信息的信息提取范式。其核心在于能够从不同视角混合信息,完美诠释了神经网络架构中"混合"的真正内涵。除通道与词元维度外,我们还可根据不同任务需求,从多种视角构建更具针对性的混合器。本研究聚焦音频识别领域,提出融合时域与频域特征的新型模型——音频谱图混合器(含滚动时间与埃尔米特FFT,简称ASM-RH)。实验结果表明,ASM-RH模型尤其契合音频数据特性,在多项分类任务中均取得优异表现。相关模型及最优权重文件将予以公开。