Sound event localization and detection (SELD) is a task for the classification of sound events and the localization of direction of arrival (DoA) utilizing multichannel acoustic signals. Prior studies employ spectral and channel information as the embedding for temporal attention. However, this usage limits the deep neural network from extracting meaningful features from the spectral or spatial domains. Therefore, our investigation in this paper presents a novel framework termed the Channel-Spectro-Temporal Transformer (CST-former) that bolsters SELD performance through the independent application of attention mechanisms to distinct domains. The CST-former architecture employs distinct attention mechanisms to independently process channel, spectral, and temporal information. In addition, we propose an unfolded local embedding (ULE) technique for channel attention (CA) to generate informative embedding vectors including local spectral and temporal information. Empirical validation through experimentation on the 2022 and 2023 DCASE Challenge task3 datasets affirms the efficacy of employing attention mechanisms separated across each domain and the benefit of ULE, in enhancing SELD performance.
翻译:声音事件定位与检测(SELD)是一项利用多通道声学信号对声音事件进行分类并估计到达方向(DoA)的任务。以往的研究将频谱与通道信息作为时域注意力的嵌入特征,然而这种用法限制了深度神经网络从频谱或空间域提取有效特征的能力。为此,本文提出一种名为通道-频谱-时域Transformer(CST-former)的新颖框架,通过分别对不同域应用注意力机制来提升SELD性能。CST-former架构采用分立的注意力机制独立处理通道、频谱和时域信息。此外,我们提出了一种用于通道注意力(CA)的展开局部嵌入(ULE)技术,以生成包含局部频谱与时域信息的信息性嵌入向量。基于2022与2023年DCASE挑战赛任务3数据集的实验验证表明,跨域分离式注意力机制的应用及ULE技术对提升SELD性能具有显著效果。