We study multi-task learning for two orthogonal speech technology tasks: speech and speaker recognition. We use wav2vec2 as a base architecture with two task-specific output heads. We experiment with different methods to mix speaker and speech information in the output embedding sequence, and propose a simple dynamic approach to balance the speech and speaker recognition loss functions. Our multi-task learning networks can produce a shared speaker and speech embedding, which are evaluated on the LibriSpeech and VoxCeleb test sets, and achieve a performance comparable to separate single-task models. Code is available at https://github.com/nikvaessen/2022-repo-mt-w2v2.
翻译:我们研究了针对两项正交语音技术任务——语音识别与说话人识别的多任务学习方法。采用wav2vec2作为基础架构,配置两个任务专用输出头。我们尝试了多种在输出嵌入序列中融合说话人与语音信息的方法,并提出了一种简单的动态策略来平衡语音识别与说话人识别损失函数。我们的多任务学习网络能生成共享的说话人与语音嵌入,在LibriSpeech和VoxCeleb测试集上的评估表现与独立单任务模型相当。代码开源地址:https://github.com/nikvaessen/2022-repo-mt-w2v2。