We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We also present a "refinement" version of the model that predicts enhanced audio tokens from the audio tokens of speech separated by a conventional separation model. Using both objective metrics and subjective MUSHRA listening tests, we show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
翻译:摘要:本文提出TokenSplit,一种作用于离散令牌序列的语音分离模型。该模型同时训练多个任务:分离并转录每个语音源,以及从文本生成语音。模型基于转录文本和音频令牌序列进行操作,并通过输入掩码实现多任务处理。模型采用序列到序列的编码器-解码器架构,基于Transformer框架。我们还提出了模型的"精细化"版本,该版本从传统分离模型分离出的语音音频令牌中预测增强后的音频令牌。通过客观指标与主观MUSHRA听力测试,我们证明模型在有无文本约束条件下均能实现优异的分离性能。同时衡量其自动语音识别(ASR)性能,并提供语音合成的音频样本,以展示模型的多重实用价值。