In recent years, the non-deterministic properties of language models have garnered considerable attention and have shown a significant influence on real-world applications. However, such properties remain under-explored in machine translation (MT), a complex, non-deterministic NLP task. In this study, we systematically evaluate modern MT systems and identify temperature-constrained Non-Deterministic MT (ND-MT) as a distinct phenomenon. Additionally, we demonstrate that ND-MT exhibits significant potential in addressing the multi-modality issue that has long challenged MT research and provides higher-quality candidates than Deterministic MT (D-MT) under temperature constraints. However, ND-MT introduces new challenges in evaluating system performance. Specifically, the evaluation framework designed for D-MT fails to yield consistent evaluation results when applied to ND-MT. We further investigate this emerging challenge by evaluating five state-of-the-art ND-MT systems across three open datasets using both lexical-based and semantic-based metrics at varying sampling sizes. The results reveal a Buckets effect across these systems: the lowest-quality candidate generated by ND-MT consistently determines the overall system ranking across different sampling sizes for all reasonable metrics. Furthermore, we propose the ExpectoSample strategy to automatically assess the reliability of evaluation metrics for selecting robust ND-MT.
翻译:近年来,语言模型的非确定性特性引起了广泛关注,并在实际应用中展现出显著影响。然而,在机器翻译(MT)这一复杂且非确定性的自然语言处理任务中,此类特性仍未得到充分探索。本研究系统评估了现代机器翻译系统,并将温度约束下的非确定性机器翻译(ND-MT)界定为一种独立现象。此外,我们证明ND-MT在解决长期困扰机器翻译研究的多模态问题方面具有显著潜力,且在温度约束下能提供比确定性机器翻译(D-MT)更高质量的候选译文。然而,ND-MT为系统性能评估带来了新挑战:针对D-MT设计的评估框架在应用于ND-MT时无法产生一致的评估结果。我们通过使用基于词汇和基于语义的度量方法,在三个开放数据集上对五种前沿ND-MT系统进行变采样规模评估,进一步探究了这一新兴挑战。结果表明这些系统普遍存在“水桶效应”:在所有合理度量标准下,ND-MT生成的最低质量候选译文始终决定着不同采样规模下的整体系统排名。此外,我们提出ExpectoSample策略,用于自动评估度量标准在筛选鲁棒性ND-MT系统时的可靠性。