Despite notable advances in large language models (LLMs), reliable evaluation of text generation tasks such as text style transfer (TST) remains an open challenge. Existing research has shown that automatic metrics often correlate poorly with human judgments (Dementieva et al., 2024; Pauli et al., 2025), limiting our ability to assess model performance accurately. Furthermore, most prior work has focused primarily on English, while the evaluation of multilingual TST systems, particularly for text detoxification, remains largely underexplored. In this paper, we present the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages: Arabic, Amharic, Chinese, English, German, Hindi, Russian, Spanish, and Ukrainian. Drawing inspiration from machine translation evaluation, we compare neural-based automatic metrics with LLM-as-a-judge approaches together with experiments on task-specific fine-tuned models. Our analysis reveals that the proposed metrics achieve significantly higher correlation with human judgments compared to baseline approaches. We also provide actionable insights and practical guidelines for building robust and reliable multilingual evaluation pipelines for text detoxification and related TST tasks.
翻译:尽管大规模语言模型(LLMs)已取得显著进展,但文本生成任务(如文本风格转换)的可靠评估仍是一个开放挑战。现有研究表明,自动评估指标与人类判断之间往往相关性较弱(Dementieva等人,2024;Pauli等人,2025),这限制了我们准确评估模型性能的能力。此外,先前研究大多集中于英语,而对多语言文本风格转换系统(尤其是文本去毒化任务)的评估仍处于探索不足的状态。本文首次针对九种语言(阿拉伯语、阿姆哈拉语、汉语、英语、德语、印地语、俄语、西班牙语和乌克兰语)的文本去毒化评估指标进行了全面的多语言基准研究。受机器翻译评估的启发,我们对比了基于神经网络的自动评估指标与LLM-as-a-judge方法,并在任务特定微调模型上进行了实验。分析表明,相较于基线方法,所提出的评估指标与人类判断的相关性显著提高。我们还为构建稳健可靠的多语言文本去毒化及相关文本风格转换任务的评估流程提供了可行见解与实践指南。