We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.
翻译:我们描述了GEMBA,一种基于GPT的翻译质量评估指标,该指标既可在有参考译文的情况下工作,也可在无参考译文时运行。在评估中,我们聚焦于零样本提示方法,根据参考译文的可用性比较了两种模式下的四种提示变体。我们研究了七个版本的GPT模型,包括ChatGPT。研究表明,我们的翻译质量评估方法仅适用于GPT 3.5及更大规模的模型。与WMT22指标共享任务的结果相比,我们的方法在两种模式下均达到了基于MQM人工标注的最新准确率。我们的结果对WMT22指标共享任务中所有三个语言对(即英译德、英译俄和汉译英)在系统层面均有效。这首次揭示了预训练生成式大型语言模型在翻译质量评估中的实用价值。我们公开了本实验使用的所有代码和提示模板,以及相应的评分结果,以支持外部验证和结果复现。