Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivator for spectrogram-based representations was their ability to present sound as a two dimensional signal in the time-frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a wide range of machine learning techniques such as convolutional neural networks, that had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its two dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks.
翻译:基于频谱图的表征方法已在深度学习音频分析系统的特征空间中占据主导地位,并常被应用于语音分析领域。最初,采用频谱图表征的主要动机在于其能够将声音呈现为时频平面上的二维信号,这不仅为声音分析提供了可解释的物理基础,同时也解锁了众多机器学习技术的应用——例如最初为图像处理开发的卷积神经网络。频谱图是一个由其二维分辨率与跨度、以及各元素的表征方式与缩放比例共同定义的矩阵。针对这三个特性,不同应用领域的研究者已探索了多种可能性,不同的参数设置在不同任务中展现出各自的适应性。本文综述了基于频谱图的表征方法,并通过梳理前沿研究,探讨了针对不同任务应如何协同选择前端特征表征与后端分类器架构。