We study multi-task learning for two orthogonal speech technology tasks: speech and speaker recognition. We use wav2vec2 as a base architecture with two task-specific output heads. We experiment with different architectural decisions to mix speaker and speech information in the output sequence as well as different optimization strategies. Our multi-task learning networks can produce a shared speaker and speech embedding, which on first glance achieve a performance comparable to separate single-task models. However, we show that the multi-task networks have strongly degraded performance on out-of-distribution evaluation data compared to the single-task models. Code and model checkpoints are available at https://github.com/nikvaessen/disjoint-mtl
翻译:我们研究了面向语音识别与说话人识别两项正交语音技术任务的多任务学习。采用wav2vec2作为基础架构,并配备两个任务特定的输出头。我们尝试了不同的架构设计来在输出序列中融合说话人与语音信息,以及不同的优化策略。我们的多任务学习网络能生成共享的说话人与语音嵌入,初步来看其性能与独立的单任务模型相当。然而,我们发现在分布外评估数据上,多任务网络的性能相比单任务模型出现严重退化。代码与模型检查点可通过 https://github.com/nikvaessen/disjoint-mtl 获取。