While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.
翻译:尽管无文本口语语言模型在端到端语音到语音建模中展现出潜力,但其在语义连贯性与相关性方面仍落后于基于文本的大型语言模型。本研究提出了Align-SLM框架,该框架借鉴人工智能反馈强化学习的偏好优化思想,以增强口语语言模型的语义理解能力。我们的方法从给定提示生成多个语音延续序列,并利用语义度量指标构建用于直接偏好优化的偏好数据。我们使用ZeroSpeech 2021基准测试评估词汇与句法建模能力,采用口语版StoryCloze数据集评估语义连贯性,并结合GPT4-o评分与人工评估等语音生成指标进行综合验证。实验结果表明,该方法在多数基准测试中实现了口语语言模型的最优性能,凸显了偏好优化对于提升口语语言模型语义能力的重要性。