The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are built upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture. Different to previous MMSCD works that extract speaker embeddings from extremely short audio segments, aligned to a single word, we use a speaker embedding extracted from 1.5s. A transformer decoder layer further improves the performance of an encoder-only MMSCD model. The proposed model achieves state-of-the-art results among studies that report SCD performance and is also on par with recent work that combines SCD with automatic speech recognition via human transcription.
翻译:说话人切换检测(SCD)任务旨在检测输入中说话人发生变化的点,该任务对多种应用至关重要。已有研究仅利用音频输入解决SCD任务,但性能有限。近期,多模态SCD(MMSCD)模型通过额外引入文本模态,展现出更优性能。本研究提出的模型基于两大核心创新:新型模态融合机制与编码器-解码器架构的采用。不同于以往MMSCD研究中从与单个词对齐的极短音频片段提取说话人嵌入的方法,我们采用从1.5秒音频中提取的说话人嵌入。通过引入Transformer解码器层,进一步提升了纯编码器MMSCD模型的性能。该模型在报告SCD性能的研究中达到了最先进水平,且与近期结合人工转录自动语音识别的SCD工作性能相当。