Recently, the ability of language models (LMs) has attracted increasing attention in visual cross-modality. In this paper, we further explore the generation capacity of LMs for sound event detection (SED), beyond the visual domain. Specifically, we propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location. The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder that generates temporal and event sequences from the audio characteristic. Compared with conventional works that require complicated processing and barely utilize limited audio features, our model is more concise and comprehensive since language model directly leverage its semantic capabilities to generate the sequences. We investigate different decoupling modules to demonstrate the effectiveness for timestamps capture and event classification. Evaluation results show that the proposed method achieves accurate sequences of sound event detection.
翻译:近期,语言模型在视觉跨模态任务中的能力引起了广泛关注。本文进一步探索语言模型在声事件检测中的生成能力,超越视觉领域。具体而言,我们提出一种简洁方法,通过对齐音频特征与文本特征来实现声音事件分类与时域定位。该框架由声学编码器、对齐文本与音频对应表征的对比模块,以及从音频特征中生成时域与事件序列的解耦语言解码器组成。相比传统方法中需要复杂处理且仅能利用有限音频特征,我们的模型更为简洁全面——语言模型直接发挥其语义能力生成序列。我们研究了不同的解耦模块,以验证其在时间戳捕获和事件分类中的有效性。评估结果表明,所提方法能够生成准确的声事件检测序列。