Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-small byte-level sequence-to-sequence model for restoring diacritics in Kashmiri text. To support this task, we release a publicly available dataset of 23.7k aligned undiacritized diacritized Kashmiri sentence pairs. The proposed framework combines script-aware normalization, alignment validation, and skeleton-preserving inference to ensure reliable restoration while maintaining the original base-letter sequence. Experimental results on a held-out test set achieve a DERm of 0.2012 and a WER of 0.2159. Additionally, evaluation by a native Kashmiri linguistic expert yields a mean accuracy of 77.5%. The dataset, model, and source code are publicly released to provide a reproducible baseline for Kashmiri diacritic restoration and future low-resource language research.
翻译:克什米尔语是一种采用改良波斯-阿拉伯字母书写的印度-雅利安语言,在数字文本中常省略变音符号,造成歧义并给下游自然语言处理应用带来挑战。我们提出科舒尔变音标记器(Koshur Diacritizer),这是一种基于ByT5-small的字节级序列到序列模型,用于恢复克什米尔语文本中的变音符号。为支持该任务,我们发布了一个包含2.37万对已对齐的无变音-有变音克什米尔语句子的公开数据集。本框架整合了脚本感知归一化、对齐验证和骨架保留推断机制,在保持原始基础字母序列的同时确保可靠恢复。在保留测试集上的实验结果显示,DERm值为0.2012,WER值为0.2159。此外,由母语为克什米尔语的语言学专家评估得出平均准确率为77.5%。数据集、模型及源代码均已公开,为克什米尔语变音恢复及未来低资源语言研究提供可复现的基线。