Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.
翻译:尽管基于文本的大型语言模型展现出人类水平的写作能力和卓越的智能,语音语言模型在生成语义连贯的输出方面仍然面临困难。导致这种性能下降的可能原因包括:(A)语音标记主要提供语音信息而非语义信息;(B)语音序列的长度远大于文本序列;(C)副语言信息(如韵律)引入了额外的复杂性和变异性。本文通过从文本到语音的渐进式模态转换,分别探究了这三个关键因素的影响。研究发现,三个因素的影响程度各不相同:因素A的影响相对较小;因素B对句法和语义建模的影响更为明显;而因素C的影响最为显著,尤其在基础词汇建模层面。基于这些发现,我们深入探讨了训练语音语言模型所面临的独特挑战,并指出了开发更有效的端到端语音语言模型的可行路径。