Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks, making sub-word segmentation a crucial process. Segmenting Sanskrit and other Indian languages into subtokens is not straightforward, as it may include sandhi, which may lead to changes in the word boundaries. We propose a new approach of utilizing a Character-level Transformer model for Sanskrit Word Segmentation (CharSS). We perform experiments on three benchmark datasets to compare the performance of our method against existing methods. On the UoH+SandhiKosh dataset, our method outperforms the current state-of-the-art system by an absolute gain of 6.72 points in split prediction accuracy. On the hackathon dataset, our method achieves a gain of 2.27 points over the current SOTA system in terms of perfect match metric. We also propose a use-case of Sanskrit-based segments for a linguistically informed translation of technical terms to lexically similar low-resource Indian languages. In two separate experimental settings for this task, we achieve an average improvement of 8.46 and 6.79 chrF++ scores, respectively.
翻译:印度语言中的子词标记本身具有含义,将其分离可增强自然语言处理任务,这使得子词分割成为关键过程。将梵语及其他印度语言分割为子标记并非易事,因其可能包含连音现象,导致词边界发生变化。我们提出一种新方法,利用字符级Transformer模型进行梵语词分割(CharSS)。我们在三个基准数据集上开展实验,将所提方法与现有方法进行性能比较。在UoH+SandhiKosh数据集上,我们的方法在分割预测准确率上以6.72个百分点的绝对优势超越当前最优系统。在hackathon数据集上,我们的方法在完全匹配指标上较当前SOTA系统提升2.27个百分点。我们还提出一种基于梵语分割单元的应用场景,用于将技术术语以语言学知识指导的方式翻译至词汇相似的低资源印度语言。在该任务的两个独立实验设置中,我们分别实现了8.46和6.79的chrF++分数平均提升。