We introduce a monaural neural speaker embeddings extractor that computes an embedding for each speaker present in a speech mixture. To allow for supervised training, a teacher-student approach is employed: the teacher computes the target embeddings from each speaker's utterance before the utterances are added to form the mixture, and the student embedding extractor is then tasked to reproduce those embeddings from the speech mixture at its input. The system much more reliably verifies the presence or absence of a given speaker in a mixture than a conventional speaker embedding extractor, and even exhibits comparable performance to a multi-channel approach that exploits spatial information for embedding extraction. Further, it is shown that a speaker embedding computed from a mixture can be used to check for the presence of that speaker in another mixture.
翻译:我们提出了一种单声道神经说话人嵌入提取器,可从语音混合中为每个存在的说话人计算嵌入。为实现监督训练,采用了师生方法:教师从每个说话人的语音片段中计算目标嵌入,随后将这些片段叠加形成混合语音;学生嵌入提取器则被训练从输入的混合语音中复现这些嵌入。与传统的说话人嵌入提取器相比,该系统能更可靠地验证混合语音中特定说话人的存在与否,甚至展现出与利用空间信息进行嵌入提取的多通道方法相当的性能。此外,研究表明,从混合语音中计算出的说话人嵌入可用于检测该说话人在另一混合语音中是否存在。