Recent advancements in language models have significantly enhanced performance in multiple speech-related tasks. Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model. However, this design omits the intrinsic connections between different speech tasks, which can potentially boost the performance of each task. In this work, we propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens. Built upon four primary tasks -- speech synthesis, speech recognition, speech language modeling, and text language modeling -- SpeechComposer can easily extend to more speech tasks via compositions of well-designed prompt tokens, like voice conversion and speech enhancement. The unification of prompt tokens also makes it possible for knowledge sharing among different speech tasks in a more structured manner. Experimental results demonstrate that our proposed SpeechComposer can improve the performance of both primary tasks and composite tasks, showing the effectiveness of the shared prompt tokens. Remarkably, the unified decoder-only model achieves a comparable and even better performance than the baselines which are expert models designed for single tasks.
翻译:近期语言模型的进展显著提升了多项语音相关任务的性能。现有语音语言模型通常利用依赖于任务的提示令牌来统一单个模型中的各类语音任务。然而,这种设计忽略了不同语音任务之间的内在关联,而这些关联可能进一步提升各项任务的表现。本文提出了一种新颖的仅解码器语音语言模型SpeechComposer,它通过组合一组固定的提示令牌来统一常见语音任务。该模型基于四项主要任务——语音合成、语音识别、语音语言建模和文本语言建模,并可借助精心设计的提示令牌组合轻松扩展至更多语音任务(如语音转换和语音增强)。提示令牌的统一还使得不同语音任务之间能够以更结构化的方式实现知识共享。实验结果表明,SpeechComposer能够提升主要任务和复合任务的性能,充分体现了共享提示令牌的有效性。值得注意的是,这一统一的仅解码器模型取得了与针对单一任务设计的专家模型相当甚至更优的性能。