In this paper, we describe the constrained MT systems submitted by Samsung R&D Institute Philippines to the WMT 2023 General Translation Task for two directions: en$\rightarrow$he and he$\rightarrow$en. Our systems comprise of Transformer-based sequence-to-sequence models that are trained with a mix of best practices: comprehensive data preprocessing pipelines, synthetic backtranslated data, and the use of noisy channel reranking during online decoding. Our models perform comparably to, and sometimes outperform, strong baseline unconstrained systems such as mBART50 M2M and NLLB 200 MoE despite having significantly fewer parameters on two public benchmarks: FLORES-200 and NTREX-128.
翻译:本文描述了三星研发中心菲律宾分部向WMT 2023通用翻译任务提交的受限机器翻译系统,涵盖两个翻译方向:英语→希伯来语和希伯来语→英语。我们的系统基于Transformer架构的序列到序列模型,融合了多项最佳实践:全面的数据预处理流程、合成反向翻译数据,以及在线解码过程中使用噪声信道重排序。尽管模型参数量显著少于mBART50 M2M和NLLB 200 MoE等强劲的基线非受限系统,在FLORES-200和NTREX-128两个公开基准测试中,我们的模型表现与这些系统相当,有时甚至更优。
Source: 三星电子