Most deep learning-based acoustic scene classification (ASC) approaches identify scenes based on acoustic features converted from audio clips containing mixed information entangled by polyphonic audio events (AEs). However, these approaches have difficulties in explaining what cues they use to identify scenes. This paper conducts the first study on disclosing the relationship between real-life acoustic scenes and semantic embeddings from the most relevant AEs. Specifically, we propose an event-relational graph representation learning (ERGL) framework for ASC to classify scenes, and simultaneously answer clearly and straightly which cues are used in classifying. In the event-relational graph, embeddings of each event are treated as nodes, while relationship cues derived from each pair of nodes are described by multi-dimensional edge features. Experiments on a real-life ASC dataset show that the proposed ERGL achieves competitive performance on ASC by learning embeddings of only a limited number of AEs. The results show the feasibility of recognizing diverse acoustic scenes based on the audio event-relational graph. Visualizations of graph representations learned by ERGL are available here (https://github.com/Yuanbo2020/ERGL).
翻译:大多数基于深度学习的声场景分类方法,依据从包含由多声部音频事件混合信息的音频片段中提取的声学特征来识别场景。然而,这些方法难以解释其识别场景所依据的线索。本文首次开展了关于现实声场景与最相关音频事件语义嵌入之间关系的研究。具体而言,我们提出了一种事件-关系图表示学习框架用于声场景分类,以对场景进行分类,同时清晰、直接地回答分类中使用了哪些线索。在事件-关系图中,每个事件的嵌入被视为节点,而由每对节点导出的关系线索则通过多维边特征描述。在现实声场景数据集上的实验表明,所提出的ERGL通过仅学习有限数量音频事件的嵌入,在声场景分类上取得了具有竞争力的性能。结果显示了基于音频事件-关系图识别多样声场景的可行性。ERGL学习到的图表示的可视化结果可在https://github.com/Yuanbo2020/ERGL获取。