Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer's training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a "subwords manifest", that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER), syntactic (POS, dependency parsing), and morphology-sensitive probes. To explain why tokenizers succeed or fail, we introduce a morphology-aware diagnostic toolkit that goes beyond coarse aggregates to boundary-level micro/macro F1, decoupled lemma atomicity vs. surface boundary hits, over/under-segmentation indices, character/word edit distances (CER/WER), continuation rates, and affix-type coverage and token-level atomicity. Our contributions are fourfold: (i) a systematic investigation of the vocabulary-corpus-success triad; (ii) a unified, morphology-aware evaluation framework linking intrinsic diagnostics to extrinsic outcomes; (iii) controlled comparisons identifying when character-level and morphology-level tokenization pay off; and (iv) an open-source release of evaluation code, tokenizer pipelines, and models. As the first work of its kind, this "subwords manifest" delivers actionable guidance for building effective tokenizers in MRLs and establishes a reproducible foundation for future research.

翻译：在土耳其语等形态丰富语言中，词元化是神经语言建模的关键设计选择——其高产的黏着特性同时对词汇效率与形态保真度构成挑战。现有研究虽探索了不同词元化器家族与词汇规模，但普遍存在以下局限：(i) 变更词汇时未系统控制词元化器的训练语料；(ii) 内在诊断指标有限；(iii) 下游任务评估范围狭窄。本研究首次对土耳其语子词词元化进行了全面系统的探索，提出"子词宣言"框架：通过联合调控词汇规模与词元化器训练语料规模（数据与词汇耦合），在匹配参数预算下比较多种词元化器家族（WordPiece、形态层级与字符基线），并在语义任务（自然语言推理、语义文本相似度、情感分析、命名实体识别）、句法任务（词性标注、依存句法分析）及形态敏感性探针任务上进行评估。为解析词元化器的成败机理，我们开发了形态感知诊断工具包，突破粗粒度聚合指标局限，涵盖边界级微观/宏观F1、解耦的词干原子性与表层边界命中率、过/欠分割指数、字符/词编辑距离、延续率、词缀类型覆盖率及词元级原子性等维度。本研究的四重贡献在于：(i) 系统探究词汇-语料-性能三元关系；(ii) 建立连接内在诊断与外在性能的统一形态感知评估框架；(iii) 通过受控比较明确字符级与形态级词元化的适用条件；(iv) 开源评估代码、词元化流水线及模型。作为该领域的开创性工作，本"子词宣言"为形态丰富语言构建高效词元化器提供了可操作的指导原则，并为未来研究奠定了可复现的基础。