Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different sized patches, representing the multi-view acoustic elements of a respiratory sound. These patches and positional embeddings are then fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the proposed MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds.
翻译:深度神经网络已被应用于音频频谱图进行呼吸音分类。现有模型通常将频谱图视为合成图像,而忽略了其物理特性。本文提出了一种多视图频谱图Transformer(MVST),将时频特性的不同视图嵌入到视觉Transformer中。具体而言,所提出的MVST将梅尔频谱图分割为不同大小的补丁,以表示呼吸音的多视图声学元素。这些补丁与位置编码随后被输入Transformer编码器,通过自注意力机制提取补丁间的注意力信息。最后,设计了一种门控融合方案,以自动加权多视图特征,从而在特定场景中突出最佳特征。在ICBHI数据集上的实验结果表明,所提出的MVST在呼吸音分类任务上显著优于现有最先进方法。