Data curation is a critical yet under-researched step in the machine translation training paradigm. To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation. But, for low-resource languages, human translation to generate sufficient data is prohibitively expensive. Therefore, it is crucial to develop a framework that screens source sentences to form efficient parallel text, ensuring optimal MT system performance in low-resource environments. We approach this by evaluating English-Hindi bi-text to determine effective sentence selection strategies for optimal MT system training. Our extensively tested framework, (Lexical And Linguistically Informed Text Analysis) LALITA, targets source sentence selection using lexical and linguistic features to curate parallel corpora. We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality. We test this by simulating low-resource data availabilty with curated datasets of 50K to 800K English sentences and report improved performances on all data sizes. LALITA demonstrates remarkable efficiency, reducing data needs by more than half across multiple languages (Hindi, Odia, Nepali, Norwegian Nynorsk, and German). This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA's utility in data augmentation.
翻译:数据筛选是机器翻译训练范式中关键但研究不足的环节。训练翻译系统时,数据获取主要依赖人工翻译与数字平行资源,或在有限程度上采用合成生成方式。然而对于低资源语言而言,通过人工翻译生成充足数据的成本极为高昂。因此,开发能够筛选源语句以构建高效平行文本的框架至关重要,这能确保低资源环境下机器翻译系统获得最优性能。我们通过评估英印地双语文本来确定有效的句子选择策略,以实现最优机器翻译系统训练。经过广泛测试的LALITA框架(词汇与语言学信息文本分析)运用词汇和语言学特征进行源语句筛选,从而构建平行语料库。研究发现,通过主要训练来自现有数据集与合成数据集的复杂语句,我们的方法能显著提升翻译质量。我们通过使用5万至80万条经筛选的英文句子数据集模拟低资源数据可用性进行测试,结果显示所有数据规模下性能均有提升。LALITA展现出卓越效率,在多种语言(印地语、奥里亚语、尼泊尔语、新挪威语和德语)中减少超过一半的数据需求。该方法不仅通过降低训练数据需求减少了机器翻译系统训练成本,同时彰显了LALITA在数据增强方面的实用价值。