This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a tradition-al neural machine translation (NMT) system across four language pairs in the legal domain. It combines automatic evaluation met-rics (AEMs) and human evaluation (HE) by professional transla-tors to assess translation ranking, fluency and adequacy. The re-sults indicate that while Google Translate generally outperforms LLMs in AEMs, human evaluators rate LLMs, especially GPT-4, comparably or slightly better in terms of producing contextually adequate and fluent translations. This discrepancy suggests LLMs' potential in handling specialized legal terminology and context, highlighting the importance of human evaluation methods in assessing MT quality. The study underscores the evolving capabil-ities of LLMs in specialized domains and calls for reevaluation of traditional AEMs to better capture the nuances of LLM-generated translations.
翻译:本研究评估了两种最先进的大语言模型与传统的神经机器翻译系统在法律领域四个语言对上的机器翻译质量。研究结合自动评估指标和由专业译员进行的人工评估,以评估翻译的排名、流畅度和充分性。结果表明,尽管谷歌翻译在自动评估指标上普遍优于大语言模型,但人工评估者认为大语言模型(尤其是GPT-4)在生成上下文充分且流畅的翻译方面表现相当或略优。这一差异表明大语言模型在处理专业法律术语和语境方面具有潜力,凸显了人工评估方法在机器翻译质量评估中的重要性。该研究强调了大语言模型在专业领域不断演进的能力,并呼吁重新审视传统的自动评估指标,以更好地捕捉大语言模型生成译文的细微差异。