Automatic evaluation of text generation is essential for improving the accuracy of generation tasks. In light of the current trend towards increasingly larger decoder-based language models, we investigate automatic evaluation methods based on such models for text generation. This paper compares various methods, including tuning with encoder-based models and large language models under equal conditions, on two different tasks, machine translation evaluation and semantic textual similarity, in two languages, Japanese and English. Experimental results show that compared to the tuned encoder-based models, the tuned decoder-based models perform poorly. The analysis of the causes for this suggests that the decoder-based models focus on surface word sequences and do not capture meaning. It is also revealed that in-context learning of very large decoder-based models such as ChatGPT makes it difficult to identify fine-grained semantic differences.
翻译:文本生成的自动评估对于提高生成任务的准确性至关重要。鉴于当前解码器语言模型规模日益增大的趋势,我们研究了基于此类模型的文本生成自动评估方法。本文在日语和英语两种语言的机器翻译评估与语义文本相似度两个不同任务上,对包括基于编码器模型和基于大型语言模型调优在内的多种方法进行了同等条件下的比较。实验结果表明,与调优后的编码器模型相比,调优后的解码器模型表现较差。原因分析表明,解码器模型主要关注表面词序列,未能捕捉语义信息。同时发现,像ChatGPT这类超大型解码器模型的上下文学习难以识别细粒度的语义差异。