Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's identity like enrollment audio, face images, or videos, which may not always be available. In this paper, we propose a text-guided TSE model StyleTSE that uses natural language descriptions of speaking style in addition to the audio clue to extract the desired speech from a given mixture. Our model integrates a speech separation network adapted from SepFormer with a bi-modality clue network that flexibly processes both audio and text clues. To train and evaluate our model, we introduce a new dataset TextrolMix with speech mixtures and natural language descriptions. Experimental results demonstrate that our method effectively separates speech based not only on who is speaking, but also on how they are speaking, enhancing TSE in scenarios where traditional audio clues are absent. Demos are at: https://mingyue66.github.io/TextrolMix/demo/
翻译:传统目标语音提取(TSE)通常依赖于说话人身份的显式线索,如注册音频、面部图像或视频,但这些线索可能并非总能获得。本文提出了一种文本引导的TSE模型StyleTSE,该模型除了利用音频线索外,还结合了描述说话风格的自然语言文本,以从给定的混合语音中提取目标语音。我们的模型集成了一个基于SepFormer改进的语音分离网络,以及一个能够灵活处理音频与文本线索的双模态线索网络。为了训练和评估模型,我们引入了一个包含语音混合信号及自然语言描述的新数据集TextrolMix。实验结果表明,我们的方法不仅能有效依据“谁在说话”进行语音分离,还能依据“如何说话”进行分离,从而在传统音频线索缺失的场景下增强了TSE的性能。演示样例位于:https://mingyue66.github.io/TextrolMix/demo/