We investigate the potential of ChatGPT as a multidimensional evaluator for the task of \emph{Text Style Transfer}, alongside, and in comparison to, existing automatic metrics as well as human judgements. We focus on a zero-shot setting, i.e. prompting ChatGPT with specific task instructions, and test its performance on three commonly-used dimensions of text style transfer evaluation: style strength, content preservation, and fluency. We perform a comprehensive correlation analysis for two transfer directions (and overall) at different levels. Compared to existing automatic metrics, ChatGPT achieves competitive correlations with human judgments. These preliminary results are expected to provide a first glimpse into the role of large language models in the multidimensional evaluation of stylized text generation.
翻译:我们研究了ChatGPT作为文本风格迁移任务的多维度评估器的潜力,并将其与现有自动评估指标及人工评判进行对比分析。我们在零样本设置下——即通过特定任务指令提示ChatGPT——测试其在风格迁移评估的三个常用维度(风格强度、内容保留度和流畅性)上的表现。针对两个迁移方向(及整体)进行了不同层次的相关性综合分析。与现有自动评估指标相比,ChatGPT在人类评判相关性方面达到了具有竞争力的水平。这些初步结果有望为大型语言模型在风格化文本生成的多维度评估中的作用提供初步认识。