This paper presents a novel syllable-based tokenization approach for Indonesian large language models, inspired by the Gasing Literacy Learning System's pedagogical methodology. Drawing on information-theoretic principles, we develop a tokenization framework that segments Indonesian text at syllable boundaries before applying byte-pair encoding, creating a vocabulary that aligns with the language's morphophonological structure. Our approach first identifies high-frequency syllables through rule-based segmentation, then constructs a compact vocabulary of 3,500 tokens that preserves meaningful linguistic units while maintaining coverage through character-level fallback. Empirical evaluation on Indonesian Wikipedia and folklore corpora from Indonesian Culture Digital Library (PDBI) demonstrates substantial improvements over conventional tokenization methods: the syllable-based approach achieves Rényi efficiency of 0.74 compared to 0.50-0.64 for pretrained multilingual tokenizers, while maintaining higher average token lengths (3.67 characters versus 2.72 for GPT-2) despite using a vocabulary an order of magnitude smaller. These gains emerge from the method's ability to internalize character-level dependencies within syllable units, reducing the computational burden on language models while respecting Indonesian's agglutinative morphology. We call the LLM built upon this principle, TOBA LLM (Tokenisasi Optimum Berbasis Aglutinasi), the convergence of human literacy pedagogy with computational optimization principles offers a promising paradigm for developing linguistically-informed tokenization strategies, particularly for morphologically rich and underrepresented languages in natural language processing.
翻译:本文提出一种新颖的基于音节的印尼语大语言模型分词方法,其灵感来源于Gasing识字学习系统的教学理念。基于信息论原理,我们开发了一种分词框架:该框架先在音节边界处切分印尼语文本,再应用字节对编码,从而构建出符合该语言形态音系结构的词表。我们的方法首先通过基于规则的切分识别高频音节,随后构建包含3,500个标记的紧凑词表,该词表在保留有意义语言单元的同时,通过字符级回退机制保持覆盖度。在印尼语维基百科及印尼文化数字图书馆(PDBI)民间故事语料上的实证评估表明,该方法相较传统分词方式有显著提升:基于音节的方法实现了0.74的Rényi效率,而预训练多语言分词器的效率仅为0.50-0.64;同时,尽管所用词表规模小一个数量级,该方法仍保持了更高的平均标记长度(3.67个字符,GPT-2为2.72个字符)。这些优势源于该方法能够将字符级依赖关系内化于音节单元之中,在尊重印尼语粘着形态的同时减轻语言模型的计算负担。我们将基于此原理构建的大语言模型命名为TOBA LLM(Tokenisasi Optimum Berbasis Aglutinasi)。人类识字教学法与计算优化原则的融合,为开发具有语言学依据的分词策略——特别是针对形态丰富且在自然语言处理中代表性不足的语言——提供了一个前景广阔的研究范式。