Evaluation of natural language generation (NLG) is complex and multi-dimensional. Generated text can be evaluated for fluency, coherence, factuality, or any other dimensions of interest. Most frameworks that perform such multi-dimensional evaluation require training on large manually or synthetically generated datasets. In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning, obviating the need for large training datasets. Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization, establishing state-of-the-art on dimensions such as relevance and factual consistency. We then analyze the effects of factors such as the selection and number of in-context examples on performance. Finally, we study the efficacy of in-context learning based evaluators in evaluating zero-shot summaries written by large language models such as GPT-3.
翻译:自然语言生成(NLG)的评估复杂且具有多维度特性。生成的文本可从流畅度、连贯性、事实准确性或任何其他关注维度进行评估。大多数实现此类多维度评估的框架需要在大规模人工或合成数据集上进行训练。本文研究了大语言模型通过上下文学习作为多维度评估器的有效性,从而避免了对大规模训练数据集的需求。实验表明,基于上下文学习的评估器在文本摘要任务上与经过学习的评估框架具有竞争力,在相关性和事实一致性等维度上达到了最先进水平。我们随后分析了上下文示例选择与数量等因素对性能的影响。最后,研究了基于上下文学习的评估器在评估GPT-3等大语言模型生成的零样本摘要时的效能。