The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.
翻译:深度神经网络的日益成功引发了对其固有黑箱性质的担忧,这给可解释性和信任带来了挑战。尽管在视觉和语言领域已广泛探索解释技术,但音频领域的可解释性研究仍十分有限,主要集中于事后解释。本文通过利用最近提出的无注意力焦点调制网络(FocalNets),解决音频领域中内在可解释性的问题。我们首次将FocalNets应用于环境声音分类任务,并在流行的ESC-50数据集上评估其可解释性属性。我们的方法在准确性和可解释性两方面均优于同等规模的视觉变换器(vision transformer)。此外,该方法与专为音频领域事后解释设计的PIQ方法相比也具有竞争力。