Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.
翻译:纯语音语言模型旨在直接从原始音频中学习语言,无需借助文本资源。一个关键挑战在于,自监督语音编码器输出的离散词元会导致序列过长,这促使近期研究关注音节级单元。然而,诸如Sylber和SyllableLM等方法依赖于复杂的多阶段训练流水线。我们提出ZeroSyl,一种无需训练的简单方法,可直接从冻结的WavLM模型中提取音节边界和嵌入表示。通过利用WavLM中间层特征的L2范数,ZeroSyl实现了有竞争力的音节分割性能。所得片段经平均池化后,使用K-means进行离散化处理,并用于训练语言模型。在词汇、句法和叙事基准测试中,ZeroSyl均优于先前的音节分词器。规模扩展实验表明,尽管细粒度单元更利于词汇任务,但我们发现的音节单元在句法建模中展现出更优的扩展行为。