Existing audio analysis methods generally first transform the audio stream to spectrogram, and then feed it into CNN for further analysis. A standard CNN recognizes specific visual patterns over feature map, then pools for high-level representation, which overlooks the positional information of recognized patterns. However, unlike natural image, the semantic of an audio spectrogram is sensitive to positional change, as its vertical and horizontal axes indicate the frequency and temporal information of the audio, instead of naive rectangular coordinates. Thus, the insensitivity of CNN to positional change plays a negative role on audio spectrogram encoding. To address this issue, this paper proposes a new self-supervised learning mechanism, which enhances the audio representation by first generating adversarial samples (\textit{i.e.}, negative samples), then driving CNN to distinguish the embeddings of negative pairs in the latent space. Extensive experiments show that the proposed approach achieves best or competitive results on 9 downstream datasets compared with previous methods, which verifies its effectiveness on audio representation learning.
翻译:现有音频分析方法通常先将音频流转换为频谱图,再将其输入CNN进行进一步分析。标准CNN在特征图上识别特定视觉模式后,通过池化生成高层表示,这忽略了已识别模式的位置信息。然而,与自然图像不同,音频频谱图的语义对位置变化敏感,因为其垂直轴和水平轴分别表示音频的频率信息和时域信息,而非简单的直角坐标。因此,CNN对位置变化的不敏感性对音频频谱图编码产生了负面影响。为解决这一问题,本文提出一种新的自监督学习机制,通过首先生成对抗样本(即负样本),再驱动CNN在潜在空间中区分负样本对的嵌入,从而增强音频表示。大量实验表明,与以往方法相比,所提方法在9个下游数据集上取得了最佳或具有竞争力的结果,验证了其在音频表示学习中的有效性。