In this work, we present the development of a reverse transliteration model to convert romanized Malayalam to native script using an encoder-decoder framework built with attention-based bidirectional Long Short Term Memory (Bi-LSTM) architecture. To train the model, we have used curated and combined collection of 4.3 million transliteration pairs derived from publicly available Indic language translitertion datasets, Dakshina and Aksharantar. We evaluated the model on two different test dataset provided by IndoNLP-2025-Shared-Task that contain, (1) General typing patterns and (2) Adhoc typing patterns, respectively. On the Test Set-1, we obtained a character error rate (CER) of 7.4%. However upon Test Set-2, with adhoc typing patterns, where most vowel indicators are missing, our model gave a CER of 22.7%.
翻译:本研究提出了一种反向音译模型,用于将罗马化的马拉雅拉姆文转换为原生文字。该模型采用基于注意力机制的双向长短期记忆(Bi-LSTM)架构构建的编码器-解码器框架。为训练模型,我们使用并整合了来自公开印度语言音译数据集Dakshina和Aksharantar的430万条音译对构成的精选数据集。我们通过IndoNLP-2025-Shared-Task提供的两个不同测试集对模型进行评估,这两个测试集分别包含:(1)常规输入模式与(2)特殊输入模式。在测试集1上,我们获得了7.4%的字符错误率(CER)。然而在测试集2上,针对多数元音标记缺失的特殊输入模式,模型的字符错误率上升至22.7%。