Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection

Determining 'who spoke what and when' remains challenging in real-world applications. In typical scenarios, Speaker Diarization (SD) is employed to address the problem of 'who spoke when,' while Target Speaker Extraction (TSE) or Target Speaker Automatic Speech Recognition (TSASR) techniques are utilized to resolve the issue of 'who spoke what.' Although some works have achieved promising results by combining SD and TSE systems, inconsistencies remain between SD and TSE regarding both output inconsistency and scenario mismatch. To address these limitations, we propose a Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection (USEF-TP) model that jointly performs TSE and Personal Voice Activity Detection (PVAD). USEF-TP leverages frame-level features obtained through a cross-attention mechanism as speaker-related features instead of using speaker embeddings as in traditional approaches. Additionally, a multi-task learning algorithm with a scenario-aware differentiated loss function is applied to ensure robust performance across various levels of speaker overlap. The experimental results show that our proposed USEF-TP model achieves superior performance in TSE and PVAD tasks on the LibriMix and SparseLibriMix datasets.

翻译：在实际应用中，确定“谁在何时说了什么”仍然具有挑战性。在典型场景中，说话人日志（SD）用于解决“谁在何时说话”的问题，而目标说话人提取（TSE）或目标说话人自动语音识别（TSASR）技术则用于解决“谁说了什么”的问题。尽管已有一些工作通过结合SD与TSE系统取得了有希望的结果，但SD与TSE之间在输出不一致性和场景不匹配方面仍存在矛盾。为解决这些局限性，我们提出了一种通用免说话人嵌入的目标说话人提取与个人语音活动检测（USEF-TP）模型，该模型联合执行TSE和个人语音活动检测（PVAD）。USEF-TP利用通过交叉注意力机制获得的帧级特征作为说话人相关特征，而非如传统方法那样使用说话人嵌入。此外，模型采用了一种具有场景感知差异化损失函数的多任务学习算法，以确保在不同程度的说话人重叠场景下均能实现鲁棒性能。实验结果表明，我们提出的USEF-TP模型在LibriMix和SparseLibriMix数据集上的TSE和PVAD任务中均取得了优越的性能。

相关内容

TSE

关注 0

IEEE软件工程事务处理对定义明确的理论结果和对软件的构建、分析或管理有潜在影响的实证研究感兴趣。这些交易的范围从制定原则的机制到将这些原则应用到具体环境。具体的主题领域包括：a）开发和维护方法和模型，例如软件系统的规范、设计和实现的技术和原则，包括符号和过程模型；b）评估方法，例如软件测试和验证、可靠性模型、测试和诊断程序，用于错误控制的软件冗余和设计，以及过程和产品各个方面的测量和评估；c）软件项目管理，例如生产力因素、成本模型、进度和组织问题、标准；d）工具和环境，例如特定工具，集成工具环境，包括相关的体系结构、数据库、并行和分布式处理问题；e）系统问题，例如硬件-软件权衡；f）最新调查，提供对某一特定关注领域历史发展的综合和全面审查。官网地址：http://dblp.uni-trier.de/db/journals/tse/

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日