Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.
翻译:纯语音语言模型旨在直接从原始音频中学习语言,而无需文本资源。一个关键挑战是,来自自监督语音编码器的离散标记会产生过长的序列,这推动了近期对类音节单元的研究。然而,Sylber 和 SyllableLM 等方法依赖于复杂的多阶段训练流程。我们提出 ZeroSyl,这是一种简单的免训练方法,可直接从冻结的 WavLM 模型中提取音节边界和嵌入。利用 WavLM 中间层特征的 L2 范数,ZeroSyl 实现了具有竞争力的音节切分性能。对所得片段进行均值池化,使用 K-means 进行离散化,并用于训练语言模型。ZeroSyl 在词汇、句法和叙事基准测试中均优于先前的音节标记器。扩展实验表明,虽然更细粒度的单元对词汇任务有益,但我们发现的音节单元在句法建模方面表现出更好的扩展性。