The Degree of Language Diacriticity and Its Effect on Tasks

Diacritics are orthographic marks that clarify pronunciation, distinguish similar words, or alter meaning. They play a central role in many writing systems, yet their impact on language technology has not been systematically quantified across scripts. While prior work has examined diacritics in individual languages, there's no cross-linguistic, data-driven framework for measuring the degree to which writing systems rely on them and how this affects downstream tasks. We propose a data-driven framework for quantifying diacritic complexity using corpus-level, information-theoretic metrics that capture the frequency, ambiguity, and structural diversity of character-diacritic combinations. We compute these metrics over 24 corpora in 15 languages, spanning both single- and multi-diacritic scripts. We then examine how diacritic complexity correlates with performance on the task of diacritics restoration, evaluating BERT- and RNN-based models. We find that across languages, higher diacritic complexity is strongly associated with lower restoration accuracy. In single-diacritic scripts, where character-diacritic combinations are more predictable, frequency-based and structural measures largely align. In multi-diacritic scripts, however, structural complexity exhibits the strongest association with performance, surpassing frequency-based measures. These findings show that measurable properties of diacritic usage influence the performance of diacritic restoration models, demonstrating that orthographic complexity is not only descriptive but functionally relevant for modeling.

翻译：变音符是附加在字母上的正字标记，用于标注发音、区分近形词或改变词义。尽管变音符在许多书写系统中发挥着核心作用，但目前缺乏系统性的跨文字量化研究来评估其对语言技术的影响。先前研究多聚焦于单一语言中的变音符现象，尚未建立跨语言、数据驱动的分析框架，用以衡量书写系统对变音符的依赖程度及其对下游任务产生的效应。本文提出一种数据驱动的量化框架，通过基于语料库的信息论指标（包括变音符与字母组合的频率、歧义性及结构多样性）来测算变音符复杂度。我们基于15种语言的24个语料库（涵盖单变音符文字与多变音符文字）计算了上述指标，进而考察了变音符复杂度与变音符恢复任务性能之间的相关性，并评估了基于BERT和RNN两类模型的表现。研究发现：跨语言环境中，变音符复杂度越高，字恢复准确率越低。在字符-变音符组合可预测性较强的单变音符文字中，基于频率和结构特征的指标基本保持一致；而在多变音符文字中，结构复杂度指标表现出与模型性能最强的关联性，其预测能力显著优于频率类指标。上述结果表明，变音符使用中可量化的特征会影响变音符恢复模型的性能，证实正字复杂度不仅具有描述性价值，更具备建模层面的功能相关性。