We present PECMAE, an interpretable model for music audio classification based on prototype learning. Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network. Instead, we propose to decouple both training processes. This enables us to leverage existing self-supervised autoencoders pre-trained on much larger data (EnCodecMAE), providing representations with better generalization. APNet allows prototypes' reconstruction to waveforms for interpretability relying on the nearest training data samples. In contrast, we explore using a diffusion decoder that allows reconstruction without such dependency. We evaluate our method on datasets for music instrument classification (Medley-Solos-DB) and genre recognition (GTZAN and a larger in-house dataset), the latter being a more challenging task not addressed with prototypical networks before. We find that the prototype-based models preserve most of the performance achieved with the autoencoder embeddings, while the sonification of prototypes benefits understanding the behavior of the classifier.
翻译:我们提出PECMAE,一种基于原型学习的可解释音乐音频分类模型。该模型基于先前方法APNet,后者联合学习自编码器和原型网络。相比之下,我们提出解耦两个训练过程,从而能够利用在更大规模数据上预训练的现有自监督自编码器(EnCodecMAE),获得泛化能力更强的表示。APNet通过依赖最近的训练数据样本实现原型到波形的重构以获得可解释性。而本研究探索使用扩散解码器进行无需此类依赖的重构。我们在乐器分类数据集(Medley-Solos-DB)和音乐流派识别数据集(GTZAN及一个更大的内部数据集)上评估方法——后者是此前原型网络尚未解决的更具挑战性任务。实验发现,基于原型的模型保留了自编码器嵌入所取得的大部分性能,同时原型的可听化有助于理解分类器的行为。