Neural Machine Translation (NMT) systems rely heavily on explicit punctuation cues to resolve semantic ambiguities in a source sentence. Inputting user-generated sentences, which are likely to contain missing or incorrect punctuation, results in fluent but semantically disastrous translations. This work attempts to highlight and address the problem of punctuation robustness of NMT systems through an English-to-Marathi translation. First, we introduce \textbf{\textit{Viram}}, a human-curated diagnostic benchmark of 54 punctuation-ambiguous English-Marathi sentence pairs to stress-test existing NMT systems. Second, we evaluate two simple remediation strategies: cascade-based \textit{restore-then-translate} and \textit{direct fine-tuning}. Our experimental results and analysis demonstrate that both strategies yield substantial NMT performance improvements. Furthermore, we find that current Large Language Models (LLMs) exhibit relatively poorer robustness in translating such sentences than these task-specific strategies, thus necessitating further research in this area. The code and dataset are available at https://github.com/KaustubhShejole/Viram_Marathi.
翻译:神经机器翻译系统严重依赖显式标点线索来消解源语句中的语义歧义。当输入用户生成的句子时(这类句子很可能存在缺失或错误的标点),系统会产生流畅但语义灾难性的翻译结果。本研究试图通过英语到马拉地语的翻译任务,揭示并解决神经机器翻译系统的标点鲁棒性问题。首先,我们提出了\textbf{\textit{Viram}}——一个包含54组标点歧义英语-马拉地语句对的人工标注诊断基准数据集,用于对现有神经机器翻译系统进行压力测试。其次,我们评估了两种简单的改进策略:基于级联的“先修复后翻译”方法和直接微调方法。实验结果表明,这两种策略均能显著提升神经机器翻译性能。此外,我们发现当前的大型语言模型在翻译此类句子时,其鲁棒性相较于这些任务专用策略表现相对较差,这表明该领域仍需进一步深入研究。代码与数据集已公开于https://github.com/KaustubhShejole/Viram_Marathi。