Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.
翻译:评估科学写作中的文本修订仍然是一个挑战,因为传统指标如ROUGE和BERTScore主要关注相似性,而非捕捉有意义的改进。在本工作中,我们分析并识别了这些指标的局限性,并探索了与人类判断更一致的其他评估方法。我们首先进行了一项人工标注研究,以评估不同修订的质量。然后,我们调查了来自相关自然语言处理领域的无参考评估指标。此外,我们研究了LLM-as-a-judge方法,分析了其在有或没有黄金参考的情况下评估修订的能力。我们的结果表明,大语言模型能有效评估指令遵循情况,但在正确性方面存在困难,而领域特定指标则提供了补充性见解。我们发现,结合LLM-as-a-judge评估和任务特定指标的混合方法,为修订质量提供了最可靠的评估。