We release MTQE.en-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. MTQE.en-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. MTQE.en-he and our experimental results enable future research on this under-resourced language pair.
翻译:我们发布了MTQE.en-he:据我们所知,这是首个公开可用的英希机器翻译质量评估基准。MTQE.en-he包含来自WMT24++的959个英文片段,每个片段均配有对应的希伯来语机器译文,并由三位专家标注了翻译质量的直接评估分数。我们对ChatGPT提示、TransQuest和CometKiwi进行了基准测试,结果表明三模型集成策略相比最佳单模型(CometKiwi)在皮尔逊相关系数上提升6.4个百分点,斯皮尔曼相关系数上提升5.6个百分点。通过TransQuest和CometKiwi的微调实验发现:全模型更新易出现过拟合和分布坍缩问题,而参数高效方法(LoRA、BitFit及仅微调分类头的FTHead)能稳定训练并获得2-3个百分点的性能提升。MTQE.en-he及其实验结果为这一资源稀缺语言对的后续研究提供了基础。