Recently, MLP structures have regained popularity, with MLP-Mixer standing out as a prominent example. In the field of computer vision, MLP-Mixer is noted for its ability to extract data information from both channel and token perspectives, effectively acting as a fusion of channel and token information. Indeed, Mixer represents a paradigm for information extraction that amalgamates channel and token information. The essence of Mixer lies in its ability to blend information from diverse perspectives, epitomizing the true concept of "mixing" in the realm of neural network architectures. Beyond channel and token considerations, it is possible to create more tailored mixers from various perspectives to better suit specific task requirements. This study focuses on the domain of audio recognition, introducing a novel model named Audio Spectrogram Mixer with Roll-Time and Hermit FFT (ASM-RH) that incorporates insights from both time and frequency domains. Experimental results demonstrate that ASM-RH is particularly well-suited for audio data and yields promising outcomes across multiple classification tasks.
翻译:近日,MLP结构重新受到关注,其中MLP-Mixer作为典型示例脱颖而出。在计算机视觉领域,MLP-Mixer以其从通道和令牌两个视角提取数据信息的能力而著称,有效实现了通道信息与令牌信息的融合。实际上,Mixer代表了一种融合通道与令牌信息的信息提取范式。其本质在于能够从不同视角整合信息,完美诠释了神经网络架构中“混合”的真正内涵。除了通道与令牌维度外,还可从多种视角定制更贴合具体任务需求的混合器。本研究聚焦于音频识别领域,引入了一种融合时域与频域特征的新型模型——基于滚动时间与厄米特FFT的音频频谱混合器(ASM-RH)。实验结果表明,ASM-RH特别适用于音频数据,并在多项分类任务中取得了优异效果。