Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.
翻译:评估科学写作中的文本修订仍具挑战性,因为传统指标如ROUGE和BERTScore主要关注相似性,而非捕捉有意义的改进。在本研究中,我们分析并指出了这些指标的局限性,并探索了能更好与人类判断对齐的替代评估方法。我们首先进行了一项人工标注研究,以评估不同修订版本的质量。随后,我们调查了来自相关自然语言处理领域的无参考评估指标。此外,我们还研究了LLM-as-a-judge方法,分析了其在有或无黄金参考情况下评估修订的能力。我们的结果表明,大型语言模型能有效评估指令遵循情况,但在正确性方面存在困难,而领域特定指标则提供了互补的见解。我们发现,结合LLM-as-a-judge评估与任务特定指标的混合方法,能为修订质量提供最可靠的评估。