Target speaker extraction (TSE) aims to extract the target speaker's voice from the input mixture. Previous studies have concentrated on high-overlapping scenarios. However, real-world applications usually meet more complex scenarios like variable speaker overlapping and target speaker absence. In this paper, we introduces a framework to perform continuous TSE (C-TSE), comprising a target speaker voice activation detection (TSVAD) and a TSE model. This framework significantly improves TSE performance on similar speakers and enhances personalization, which is lacking in traditional diarization methods. In detail, unlike conventional TSVAD deployed to refine the diarization results, the proposed Attention-target speaker voice activation detection (A-TSVAD) directly generates timestamps of the target speaker. We also explore some different integration methods of A-TSVAD and TSE by comparing the cascaded and parallel methods. The framework's effectiveness is assessed using a range of metrics, including diarization and enhancement metrics. Our experiments demonstrate that A-TSVAD outperforms conventional methods in reducing diarization errors. Furthermore, the integration of A-TSVAD and TSE in a sequential cascaded manner further enhances extraction accuracy.
翻译:目标说话人提取(TSE)旨在从输入混合语音中提取目标说话人的声音。以往研究主要关注高重叠场景,但实际应用通常面临更复杂的情况,如可变说话人重叠和目标说话人缺失。本文提出一种连续目标说话人提取(C-TSE)框架,包含目标说话人语音活动检测(TSVAD)与TSE模型。该框架显著提升了相似说话人场景下的TSE性能,弥补了传统说话人分割方法在个性化方面的不足。具体而言,与用于优化分割结果的常规TSVAD不同,提出的注意力目标说话人语音活动检测(A-TSVAD)可直接生成目标说话人的时间戳。我们还通过级联与并行方式的对比,探索了A-TSVAD与TSE的不同集成方法。采用包括分割指标与增强指标在内的多种指标评估框架有效性。实验表明,A-TSVAD在减少分割误差方面优于传统方法,且通过顺序级联方式集成A-TSVAD与TSE可进一步提升提取精度。