Evaluating the quality of generated text is a challenging task in natural language processing. This difficulty arises from the inherent complexity and diversity of text. Recently, OpenAI's ChatGPT, a powerful large language model (LLM), has garnered significant attention due to its impressive performance in various tasks. Therefore, we present this report to investigate the effectiveness of LLMs, especially ChatGPT, and explore ways to optimize their use in assessing text quality. We compared three kinds of reference-free evaluation methods based on ChatGPT or similar LLMs. The experimental results prove that ChatGPT is capable to evaluate text quality effectively from various perspectives without reference and demonstrates superior performance than most existing automatic metrics. In particular, the Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches. However, directly comparing the quality of two texts using ChatGPT may lead to suboptimal results. We hope this report will provide valuable insights into selecting appropriate methods for evaluating text quality with LLMs such as ChatGPT.
翻译:评估生成文本质量是自然语言处理中的一项具有挑战性的任务。这一困难源于文本固有的复杂性和多样性。近来,OpenAI 的 ChatGPT——一种强大的大语言模型(LLM)——因其在多项任务中的出色表现而备受关注。因此,我们撰写了本报告,旨在研究大语言模型(尤其是 ChatGPT)的有效性,并探索优化其在文本质量评估中的应用方式。我们比较了三种基于 ChatGPT 或类似大语言模型的无参考答案评估方法。实验结果证明,ChatGPT 能够从多个角度有效评估文本质量而无需参考答案,且性能优于大多数现有的自动评估指标。其中,显式评分法(Explicit Score)——利用 ChatGPT 生成衡量文本质量的数值分数——是三种方法中最有效且最可靠的方法。然而,直接使用 ChatGPT 比较两段文本的质量可能导致次优结果。我们希望本报告能为选择使用如 ChatGPT 等大语言模型评估文本质量的适当方法提供有价值的见解。