Conformer and Mamba have achieved strong performance in speech modeling but face limitations in speaker diarization. Mamba is efficient but struggles with local details and nonlinear patterns. Conformer's self-attention incurs high memory overhead for long speech sequences and may cause instability in long-range dependency modeling. These limitations are critical for diarization, which requires both precise modeling of local variations and robust speaker consistency over extended spans. To address these challenges, we first apply ConBiMamba for speaker diarization. We follow the Pyannote pipeline and propose the Dual-Strategy-Enhanced ConBiMamba neural speaker diarization system. ConBiMamba integrates the strengths of Conformer and Mamba, where Conformer's convolutional and feed-forward structures are utilized to improve local feature extraction. By replacing Conformer's self-attention with ExtBiMamba, ConBiMamba efficiently handles long audio sequences while alleviating the high memory cost of self-attention. Furthermore, to address the problem of the higher DER around speaker change points, we introduce the Boundary-Enhanced Transition Loss to enhance the detection of speaker change points. We also propose Layer-wise Feature Aggregation to enhance the utilization of multi-layer representations. The system is evaluated on six diarization datasets and achieves state-of-the-art performance on four of them. The source code of our study is available at https://github.com/lz-hust/DSE-CBM.
翻译:Conformer和Mamba在语音建模中取得了强劲的性能,但在说话人日志任务中存在局限。Mamba效率高,但在局部细节和非线性模式建模上存在困难。Conformer的自注意力机制在处理长语音序列时会产生高内存开销,并可能在长程依赖建模中导致不稳定性。这些局限对于说话人日志至关重要,因为该任务既需要精确建模局部变化,又需要在长时间跨度上保持稳健的说话人一致性。为应对这些挑战,我们首次将ConBiMamba应用于说话人日志。我们遵循Pyannote流程,提出了双策略增强的ConBiMamba神经说话人日志系统。ConBiMamba整合了Conformer和Mamba的优势,其中利用Conformer的卷积和前馈结构来改进局部特征提取。通过用ExtBiMamba替换Conformer的自注意力,ConBiMamba能够高效处理长音频序列,同时缓解自注意力的高内存成本。此外,针对说话人转换点附近较高的说话人错误率问题,我们引入了边界增强转换损失以增强说话人转换点的检测。我们还提出了分层特征聚合以增强对多层表征的利用。该系统在六个说话人日志数据集上进行了评估,并在其中四个数据集上取得了最先进的性能。本研究的源代码公开于 https://github.com/lz-hust/DSE-CBM。