Tokenization shapes how language models perceive morphology and meaning in NLP, yet widely used frequency-driven subword tokenizers (e.g., Byte Pair Encoding and WordPiece) can fragment morphologically rich and agglutinative languages in ways that obscure morpheme boundaries. We introduce a linguistically informed hybrid tokenizer for Turkish that combines (i) dictionary-driven morphological segmentation (roots and affixes), (ii) phonological normalization that maps allomorphic variants to shared identifiers, and (iii) a controlled subword fallback for out-of-vocabulary coverage. Concretely, our released Turkish vocabulary contains 22,231 root tokens mapped to 20,000 canonical root identifiers (with leading spaces to mark word boundaries), 72 affix identifiers that cover 177 allomorphic surface forms, and 12,696 subword units; an orthographic case token preserves capitalization without inflating the vocabulary. We evaluate tokenization quality on the TR-MMLU dataset using two linguistic alignment metrics: Turkish Token Percentage (TR~\%), the proportion of produced tokens that correspond to Turkish lexical/morphemic units under our lexical resources, and Pure Token Percentage (Pure~\%), the proportion of tokens aligning with unambiguous root/affix boundaries. The proposed tokenizer reaches 90.29\% TR~\% and 85.80\% Pure~\% on TR-MMLU, substantially exceeding several general-purpose tokenizers. We further validate practical utility with downstream sentence embedding benchmarks under a strict \emph{random initialization} control to isolate tokenizer inductive bias. Across four matched models (TurkishTokenizer, CosmosGPT2, Mursit, and Tabi), TurkishTokenizer outperforms all baselines on the Turkish STS Benchmark and achieves the strongest overall average on MTEB-TR. It also yields the strongest average accuracy on the TurBLiMP under a centroid-based proxy.
翻译:分词决定了语言模型在自然语言处理中如何感知形态与意义,然而广泛使用的基于频率的子词分词器(如字节对编码和WordPiece)在处理形态丰富且具有黏着特性的语言时,可能以模糊语素边界的方式割裂词汇。本文提出一种融合语言学知识的混合分词器,专为土耳其语设计,其结合了:(一)基于词典的形态切分(词根与词缀),(二)将同素异形变体映射至共享标识符的音系归一化,以及(三)用于未登录词覆盖的受控子词回退机制。具体而言,我们发布的土耳其语词表包含22,231个映射至20,000个规范词根标识符的词根词元(以前置空格标记词边界),覆盖177种表层形式的72个词缀标识符,以及12,696个子词单元;另设一个正字法大小写词元以保留大小写信息而不扩充词表规模。我们在TR-MMLU数据集上使用两项语言学对齐指标评估分词质量:土耳其语词元占比(TR~\%),即所生成词元与我们的词汇资源中土耳其语词汇/语素单元对应的比例;以及纯词元占比(Pure~\%),即与明确词根/词缀边界对齐的词元比例。所提出的分词器在TR-MMLU上达到90.29\%的TR~\%与85.80\%的Pure~\%,显著超越多种通用分词器。我们进一步通过下游句子嵌入基准测试,在严格的\emph{随机初始化}控制下验证其实用性,以分离分词器的归纳偏差。在四种匹配模型(TurkishTokenizer、CosmosGPT2、Mursit与Tabi)中,TurkishTokenizer在土耳其语STS基准测试上优于所有基线,并在MTEB-TR上取得最强的整体平均表现。同时,在基于质心的代理评估下,该分词器在TurBLiMP上也获得了最高的平均准确率。