Current language models use subword-based tokenization algorithms like Byte Pair Encoding, which put their validity as models of linguistic representations into question. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models without any graphemic biases almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.
翻译:当前的语言模型采用诸如字节对编码之类的子词分词算法,这使其作为语言表征模型的有效性受到质疑。本文探讨了基于音素与字形、无需分词的语音模型的潜力。我们证明,基于Llama架构的小型模型在使用字符级词汇表训练后,能够在标准句法任务及新颖的词汇/语音基准测试中表现出强大的语言能力。我们进一步发现,在标准任务及新颖评估中,完全不含字形偏见的音素模型几乎能达到字形模型的性能水平。我们的研究结果为构建更具语言合理性、更适用于语言习得与加工计算研究的新型语言模型指明了有前景的方向。