Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality. We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion. Leveraging speech-text alignments, our approach segments and compresses speech features to match the granularity of text embeddings. Additionally, we introduce a two-stage training pipeline that includes the distillation and fine-tuning phases to mitigate catastrophic forgetting. SSR-Connector outperforms existing mechanism for speech-text modality fusion, consistently achieving better speech understanding (e.g., +10 accuracy on StoryCloze and +20 on Speech-MMLU) while preserving pre-trained text ability.
翻译:将语音融入预训练语言模型(SpeechLM)通常面临长语音编码效率低下以及预训练文本模态灾难性遗忘的问题。本文提出SSR-Connector(分段语音表征连接器)以实现更优的模态融合。该方法利用语音-文本对齐信息,对语音特征进行分段压缩,使其与文本嵌入的粒度相匹配。此外,我们设计了一个包含蒸馏与微调阶段的两阶段训练流程,以缓解灾难性遗忘问题。SSR-Connector在语音-文本模态融合任务上超越了现有机制,在保持预训练文本能力的同时,持续取得更优的语音理解性能(例如在StoryCloze任务上准确率提升+10,在Speech-MMLU任务上提升+20)。