With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in this paper, learn audio embeddings via diverse feature representations, in this case, domain-specific. For the case of audio classification over hundreds of categories of sound, we learn robust separate embeddings for diverse audio properties such as pitch, timbre, and neural representation, along with also learning it via an end-to-end architecture. We observe handcrafted embeddings, e.g., pitch and timbre-based, although on their own, are not able to beat a fully end-to-end representation, yet adding these together with end-to-end embedding helps us, significantly improve performance. This work would pave the way to bring some domain expertise with end-to-end models to learn robust, diverse representations, surpassing the performance of just training end-to-end models.
翻译:随着现代人工智能架构的出现,研究趋势已转向端到端架构。这一转变导致神经架构在没有领域特定偏见/知识的情况下进行训练,并根据任务进行优化。在本文中,我们通过多样的特征表示(在此为领域特定特征)来学习音频嵌入。针对数百种声音类别的音频分类任务,我们为不同音频属性(如音高、音色和神经表示)学习鲁棒的独立嵌入,同时通过端到端架构进行联合学习。我们观察到,基于音高和音色等手工设计的嵌入虽然本身无法超越完全端到端的表征,但将其与端到端嵌入结合后,能显著提升性能。这项工作将为结合领域知识与端到端模型以学习鲁棒、多样的表征铺平道路,其性能将超越仅训练端到端模型的方法。