Evaluating the quality of generated text is a challenging task in natural language processing. This difficulty arises from the inherent complexity and diversity of text. Recently, OpenAI's ChatGPT, a powerful large language model (LLM), has garnered significant attention due to its impressive performance in various tasks. Therefore, we present this report to investigate the effectiveness of LLMs, especially ChatGPT, and explore ways to optimize their use in assessing text quality. We compared three kinds of reference-free evaluation methods based on ChatGPT or similar LLMs. The experimental results prove that ChatGPT is capable to evaluate text quality effectively from various perspectives without reference and demonstrates superior performance than most existing automatic metrics. In particular, the Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches. However, directly comparing the quality of two texts using ChatGPT may lead to suboptimal results. We hope this report will provide valuable insights into selecting appropriate methods for evaluating text quality with LLMs such as ChatGPT.
翻译:评估生成文本质量是自然语言处理中的一项挑战性任务,这一困难源于文本固有的复杂性和多样性。近期,OpenAI的ChatGPT作为一种强大的大型语言模型(LLM),因其在多项任务中的卓越表现而备受关注。因此,本报告旨在研究LLM(尤其是ChatGPT)的有效性,并探索优化其在文本质量评估中的应用。我们比较了基于ChatGPT或类似LLM的三种无参考评估方法。实验结果表明,ChatGPT能够从多个角度有效评估文本质量且无需参考,其性能优于大多数现有的自动评估指标。其中,显式评分法——利用ChatGPT生成衡量文本质量的数值分数——在所采用的三种方法中最为有效和可靠。然而,直接使用ChatGPT比较两段文本质量可能导致次优结果。希望本报告能为选择利用ChatGPT等LLM进行文本质量评估的适当方法提供有价值的见解。