The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP
翻译:子词分词的质量对于大语言模型至关重要,然而,对形态丰富的乌拉尔语系语言的分词器进行评估,却因缺乏清晰的语素词典而受阻。我们介绍了SampoNLP,一个无需语料库、利用受MDL启发的自参照原子性评分来创建形态学词典的工具包。该方法通过内部结构线索过滤复合形式,适用于低资源场景。利用SampoNLP为芬兰语、匈牙利语和爱沙尼亚语生成的高纯度词典,我们对一系列词汇表大小(8k-256k)的BPE分词器进行了系统评估。我们提出了一个统一的度量标准——综合性能分数(IPS),以权衡语素覆盖率和过度切分。通过分析IPS曲线,我们识别了收益递减的“拐点”,并为这些语言首次提供了基于实证的最优词汇表大小(k)建议。我们的研究不仅提供了实用指导,还定量地证明了标准BPE对于高度黏着性语言的局限性。SampoNLP库及所有生成的资源均已公开:https://github.com/AragonerUA/SampoNLP