Sound events in daily life carry rich information about the objective world. The composition of these sounds affects the mood of people in a soundscape. Most previous approaches only focus on classifying and detecting audio events and scenes, but may ignore their perceptual quality that may impact humans' listening mood for the environment, e.g. annoyance. To this end, this paper proposes a novel hierarchical graph representation learning (HGRL) approach which links objective audio events (AE) with subjective annoyance ratings (AR) of the soundscape perceived by humans. The hierarchical graph consists of fine-grained event (fAE) embeddings with single-class event semantics, coarse-grained event (cAE) embeddings with multi-class event semantics, and AR embeddings. Experiments show the proposed HGRL successfully integrates AE with AR for AEC and ARP tasks, while coordinating the relations between cAE and fAE and further aligning the two different grains of AE information with the AR.
翻译:日常生活中的声音事件蕴含着丰富的客观世界信息,这些声音的构成影响着声景中人们的情绪。以往的研究大多聚焦于音频事件与场景的分类与检测,却往往忽视了可能影响人类对环境的听觉情绪(如烦扰度)的感知质量。为此,本文提出了一种新颖的层级图表示学习(HGRL)方法,将客观音频事件(AE)与人类感知的声景主观烦扰度评级(AR)关联起来。该层级图由具备单类事件语义的细粒度事件(fAE)嵌入、具备多类事件语义的粗粒度事件(cAE)嵌入以及AR嵌入共同构成。实验表明,所提出的HGRL方法在音频事件分类(AEC)与烦扰度评级预测(ARP)任务中成功实现了AE与AR的融合,同时协调了cAE与fAE之间的关系,并进一步将两种不同粒度的AE信息与AR进行对齐。