An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.
翻译:越来越多的研究利用多语言语言模型处理自然语言生成任务(如摘要生成)。该领域的一个主要实证瓶颈是许多语言缺乏准确且鲁棒的评估指标,这阻碍了研究进展。近期研究表明,多语言语言模型常将英语作为内部枢轴语言,而与这一枢轴的错位可能导致下游性能下降。受此启发,我们假设这种不匹配现象同样可能存在于多语言神经评估指标中,进而探究将其激活状态导向英语枢轴是否能提升与人工评判的相关性。通过对基于编码器和解码器的评估指标进行实验,我们发现测试时干预方法普遍有效,能显著提升指标在多种语言上的评估效能。