We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. Since human A/B testing is commonly used, we assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations. Our study not only confirms several prior findings, such as the advantage of neural metrics over non-neural ones, but also explores the debated issue of how MT quality affects metric reliability--an investigation that smaller datasets in previous research could not sufficiently explore. Overall, our research demonstrates the dataset's value as a testbed for metric evaluation. We release our code at https://github.com/gjwubyron/Evo
翻译:我们引入了一个包含商业机器翻译的数据集,该数据集在六年内每周收集,涵盖12个翻译方向。由于人类A/B测试被广泛使用,我们假设商业系统会随时间改进,这使我们能够基于机器翻译(MT)指标对较新翻译的偏好来评估这些指标。我们的研究不仅证实了先前的若干发现,例如神经指标相对于非神经指标的优势,还探讨了MT质量如何影响指标可靠性这一存在争议的问题——先前研究中较小的数据集无法充分探索这一议题。总体而言,我们的研究证明了该数据集作为指标评估测试平台的价值。我们在https://github.com/gjwubyron/Evo发布了代码。