While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 29 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) In TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) Metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.
翻译:尽管自动度量指标推动了机器翻译(MT)和文本摘要(TS)的进展,但现有度量指标的开发与验证几乎完全局限于英语及其他高资源语言。这一狭窄的关注点使得拥有超过15亿使用人口的印度语言在很大程度上被忽视,从而对当前评估实践的普适性提出质疑。为填补这一空白,我们提出ITEM——一个大规模基准测试,系统性地评估了29项自动度量指标与六种主要印度语言人工判断的一致性,并辅以细粒度标注。我们通过涵盖与人工判断的一致性、对异常值的敏感性、语言特异性可靠性、度量间相关性以及对受控扰动的鲁棒性等维度的广泛评估,揭示了四项核心发现:(1)基于大语言模型(LLM)的评估器在段落和系统层级均展现出与人工判断最强的一致性;(2)异常值对度量-人工一致性产生显著影响;(3)在文本摘要任务中,度量指标更擅长捕捉内容忠实度,而在机器翻译任务中则更有效地反映流畅性;(4)不同度量指标在面对多样化扰动时表现出差异化的鲁棒性和敏感性。这些发现综合起来,为推进印度语言的度量指标设计与评估提供了关键指导。