Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment. Taken together, these results indicate that cross-lingual interleaving is a simple, scalable route to building multilingual SLMs that understand and converse across languages. All resources will be made open-source to support reproducibility.
翻译:语音语言模型旨在通过离散单元直接从语音中学习语言能力,从而为书面资源有限的语言拓宽自然语言处理技术的应用范围。然而,由于缺乏口语评估基准和训练数据,相关进展主要集中于英语,这使得跨语言学习变得困难。我们提出了一种跨语言交织方法,该方法无需文本监督即可混合不同语言的语音标记。我们还发布了一个英法训练数据集TinyStories(约4.2万小时),以及用于跨语言语义评估的英法口语StoryCloze和TopicCloze基准,两者均使用GPT-4合成生成。在训练标记预算匹配的360M和1B参数语音语言模型上,交织训练提高了单语言语义准确性,实现了稳健的跨语言续写能力,并增强了跨语言隐藏状态对齐。综合来看,这些结果表明跨语言交织训练是一种简单、可扩展的途径,可用于构建能够理解和跨语言对话的多语言语音语言模型。所有资源将开源以支持可复现性。