超越说话人身份：基于文本引导的目标语音提取 (Beyond Speaker Identity: Text Guided Target Speech Extraction)

Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's identity like enrollment audio, face images, or videos, which may not always be available. In this paper, we propose a text-guided TSE model StyleTSE that uses natural language descriptions of speaking style in addition to the audio clue to extract the desired speech from a given mixture. Our model integrates a speech separation network adapted from SepFormer with a bi-modality clue network that flexibly processes both audio and text clues. To train and evaluate our model, we introduce a new dataset TextrolMix with speech mixtures and natural language descriptions. Experimental results demonstrate that our method effectively separates speech based not only on who is speaking, but also on how they are speaking, enhancing TSE in scenarios where traditional audio clues are absent. Demos are at: https://mingyue66.github.io/TextrolMix/demo/

翻译：传统目标语音提取（TSE）通常依赖于说话人身份的显式线索，如注册音频、面部图像或视频，但这些线索可能并非总能获得。本文提出了一种文本引导的TSE模型StyleTSE，该模型除了利用音频线索外，还结合了描述说话风格的自然语言文本，以从给定的混合语音中提取目标语音。我们的模型集成了一个基于SepFormer改进的语音分离网络，以及一个能够灵活处理音频与文本线索的双模态线索网络。为了训练和评估模型，我们引入了一个包含语音混合信号及自然语言描述的新数据集TextrolMix。实验结果表明，我们的方法不仅能有效依据“谁在说话”进行语音分离，还能依据“如何说话”进行分离，从而在传统音频线索缺失的场景下增强了TSE的性能。演示样例位于：https://mingyue66.github.io/TextrolMix/demo/

相关内容

TSE

关注 0

IEEE软件工程事务处理对定义明确的理论结果和对软件的构建、分析或管理有潜在影响的实证研究感兴趣。这些交易的范围从制定原则的机制到将这些原则应用到具体环境。具体的主题领域包括：a）开发和维护方法和模型，例如软件系统的规范、设计和实现的技术和原则，包括符号和过程模型；b）评估方法，例如软件测试和验证、可靠性模型、测试和诊断程序，用于错误控制的软件冗余和设计，以及过程和产品各个方面的测量和评估；c）软件项目管理，例如生产力因素、成本模型、进度和组织问题、标准；d）工具和环境，例如特定工具，集成工具环境，包括相关的体系结构、数据库、并行和分布式处理问题；e）系统问题，例如硬件-软件权衡；f）最新调查，提供对某一特定关注领域历史发展的综合和全面审查。官网地址：http://dblp.uni-trier.de/db/journals/tse/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日