Determining 'who spoke what and when' remains challenging in real-world applications. In typical scenarios, Speaker Diarization (SD) is employed to address the problem of 'who spoke when,' while Target Speaker Extraction (TSE) or Target Speaker Automatic Speech Recognition (TSASR) techniques are utilized to resolve the issue of 'who spoke what.' Although some works have achieved promising results by combining SD and TSE systems, inconsistencies remain between SD and TSE regarding both output inconsistency and scenario mismatch. To address these limitations, we propose a Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection (USEF-TP) model that jointly performs TSE and Personal Voice Activity Detection (PVAD). USEF-TP leverages frame-level features obtained through a cross-attention mechanism as speaker-related features instead of using speaker embeddings as in traditional approaches. Additionally, a multi-task learning algorithm with a scenario-aware differentiated loss function is applied to ensure robust performance across various levels of speaker overlap. The experimental results show that our proposed USEF-TP model achieves superior performance in TSE and PVAD tasks on the LibriMix and SparseLibriMix datasets.
翻译:在实际应用中,确定“谁在何时说了什么”仍然具有挑战性。在典型场景中,说话人日志(SD)用于解决“谁在何时说话”的问题,而目标说话人提取(TSE)或目标说话人自动语音识别(TSASR)技术则用于解决“谁说了什么”的问题。尽管已有一些工作通过结合SD与TSE系统取得了有希望的结果,但SD与TSE之间在输出不一致性和场景不匹配方面仍存在矛盾。为解决这些局限性,我们提出了一种通用免说话人嵌入的目标说话人提取与个人语音活动检测(USEF-TP)模型,该模型联合执行TSE和个人语音活动检测(PVAD)。USEF-TP利用通过交叉注意力机制获得的帧级特征作为说话人相关特征,而非如传统方法那样使用说话人嵌入。此外,模型采用了一种具有场景感知差异化损失函数的多任务学习算法,以确保在不同程度的说话人重叠场景下均能实现鲁棒性能。实验结果表明,我们提出的USEF-TP模型在LibriMix和SparseLibriMix数据集上的TSE和PVAD任务中均取得了优越的性能。