In the context of environmental sound classification, the adaptability of systems is key: which sound classes are interesting depends on the context and the user's needs. Recent advances in text-to-audio retrieval allow for zero-shot audio classification, but performance compared to supervised models remains limited. This work proposes a multimodal prototypical approach that exploits local audio-text embeddings to provide more relevant answers to audio queries, augmenting the adaptability of sound detection in the wild. We do this by first using text to query a nearby community of audio embeddings that best characterize each query sound, and select the group's centroids as our prototypes. Second, we compare unseen audio to these prototypes for classification. We perform multiple ablation studies to understand the impact of the embedding models and prompts. Our unsupervised approach improves upon the zero-shot state-of-the-art in three sound recognition benchmarks by an average of 12%.
翻译:在环境声音分类的背景下,系统的适应性至关重要:哪些声音类别具有兴趣取决于具体情境和用户需求。文本-音频检索的最新进展实现了零样本音频分类,但与有监督模型相比性能仍然有限。本研究提出一种多模态原型方法,利用局部音频-文本嵌入为音频查询提供更相关的答案,从而增强野外声音检测的适应性。我们首先利用文本查询邻近的音频嵌入社区,以最佳表征每个查询声音,并选择该社区的中心作为原型。其次,将未见过的音频与这些原型进行比较以进行分类。我们进行了多项消融研究,以理解嵌入模型和提示词的影响。我们的无监督方法在三个声音识别基准测试中比零样本最先进方法平均提升了12%。