Recent efforts have been made on acoustic scene classification in the audio signal processing community. In contrast, few studies have been conducted on acoustic scene clustering, which is a newly emerging problem. Acoustic scene clustering aims at merging the audio recordings of the same class of acoustic scene into a single cluster without using prior information and training classifiers. In this study, we propose a method for acoustic scene clustering that jointly optimizes the procedures of feature learning and clustering iteration. In the proposed method, the learned feature is a deep embedding that is extracted from a deep convolutional neural network (CNN), while the clustering algorithm is the agglomerative hierarchical clustering (AHC). We formulate a unified loss function for integrating and optimizing these two procedures. Various features and methods are compared. The experimental results demonstrate that the proposed method outperforms other unsupervised methods in terms of the normalized mutual information and the clustering accuracy. In addition, the deep embedding outperforms many state-of-the-art features.
翻译:近年来,音频信号处理领域在声学场景分类方面取得了一系列进展。与此相对,声学场景聚类作为新兴问题,相关研究尚不充分。该任务旨在无需先验信息与分类器训练的前提下,将同一类别的声学场景音频记录归并至同一簇。本研究提出一种联合优化特征学习与聚类迭代过程的声学场景聚类方法:所学习的特征为深度卷积神经网络(CNN)提取的深度嵌入,聚类算法采用凝聚层次聚类(AHC)。我们构建统一损失函数以整合并优化这两个过程,并对多种特征与方法进行比较。实验结果表明,在归一化互信息与聚类精度指标上,该方法优于其他无监督方法;此外,深度嵌入性能亦超越多种先进特征。